Caches aplenty

Thursday 4 September 2008

My laptop has a 100Gb drive, and recently it was 98% or so full! As part of the job of cleaning it up, I used SpaceMonger to see where it the space was going. I noticed a few largish directories whose names indicated they were caches of some sort, and wondered how much disk was being lost to copies of files that I didn’t really need to keep around.

I cobbled together this Python script to recursively list the size of folders and files, but only if they exceed specified minimums:

""" List file sizes recursively, but only if they exceed
    certain minimums.
"""

import stat, os

# Minimum size for a file or directory to be listed.
min_file = 10000000
min_dir = 1000000

format = "%15d   %s"
dir_format = "%15d / %s"
err_format = "            !!! ! %s"

def do_dir(d):
    """ Process a single directory, return its total size,
        and print intermediate results along the way.
    """

    try:
        files = os.listdir(d)
    except KeyboardInterrupt:
        raise
    except Exception, e:
        print err_format % str(e)
        return 0

    files.sort()
    total = 0

    for f in files:
        f = os.path.join(d, f)
        st = os.stat(f)
        size = st[stat.ST_SIZE]
        is_dir = stat.S_ISDIR(st[stat.ST_MODE])
        if is_dir:
            size = do_dir(f)
        else:
            if size >= min_file:
                print format % (size, f)
        total += size

    if total >= min_dir:
        print dir_format % (total, d)

    return total

if __name__ == '__main__':
    do_dir(".")

Running this on my disk, and grep’ing for “cache”, I came up with this list of cache directories:

       77428736 / .\Documents and Settings\All Users\Application Data\Apple\Installer Cache
      193088296 / .\Documents and Settings\All Users\Application Data\Apple Computer\Installer Cache
      127431856 / .\Documents and Settings\All Users\Application Data\Symantec\Cached Installs
        1283586 / .\Documents and Settings\All Users\DRM\Cache
        8904444 / .\Documents and Settings\batcheln\Application Data\Adobe\CameraRaw\Cache
        3109555 / .\Documents and Settings\batcheln\Application Data\Dropbox\cache
        9141658 / .\Documents and Settings\batcheln\Application Data\Microsoft\CryptnetUrlCache
        6639905 / .\Documents and Settings\batcheln\Application Data\Sun\Java\Deployment\cache
      244047364 / .\Documents and Settings\batcheln\Local Settings\Application Data\Adobe\CameraRaw\Cache
       35706839 / .\Documents and Settings\batcheln\Local Settings\Application Data\Mozilla\Firefox\Profiles\0ou4abpz.default\Cache
        1559441 / .\Documents and Settings\batcheln\Local Settings\Application Data\johnsadventures.com\Background Switcher\FolderQuarterScreenCache
      381984768   .\Documents and Settings\batcheln\My Documents\My Pictures\Lightroom\Lightroom Catalog Previews.lrdata\thumbnail-cache.db
       44671279 / .\Program Files\Adobe\Adobe Help Center\AdobeHelpData\Cache
        1093120 / .\Program Files\Common Files\Microsoft Shared\SFPCA Cache
     1139888470 / .\Program Files\Cyan Worlds\Myst Uru Complete Chronicles\sfx\streamingCache
       73237698 / .\Program Files\Hewlett-Packard\PC COE 3\OV CMS\Lib\Cache
       46559334 / .\WINDOWS\assembly\GAC
       20606686 / .\WINDOWS\assembly\GAC_32
       55143608 / .\WINDOWS\assembly\GAC_MSIL
      105975390 / .\WINDOWS\Driver Cache
       96353450 / .\WINDOWS\Installer\$PatchCache$
        1898024 / .\WINDOWS\SchCache
        1174871 / .\WINDOWS\pchealth\helpctr\OfflineCache
      451465998 / .\WINDOWS\system32\dllcache

(I also included the GAC directories: .net Global Assembly Caches). Summing these sizes, I see that 3 Gb or so of space is occupied by self-declared caches. For many of these I don’t know whether it is safe to delete them. Luckily the largest was a game I installed for Max and could completely uninstall.

Windows provides the Disk Cleanup utility, which knows how to get rid of a bunch of stuff you don’t really need. Application developers can even write a handler to clean up their own unneeded files, but it seems application developers don’t, as I don’t have any custom handlers on my machine.

CCleaner is a Windows utility to scrub a little harder at your disk, but even it missed some of these folders: for example, it removed the smaller of the CameraRaw caches (8 Mb), but left the larger (244 Mb). I read online that CameraRaw really doesn’t need those files, so I removed them by hand.

I’m all for applications making use of disk space to improve the user experience, but they should do it responsibly: give me a way to see what’s being used, and give me a way to get it back. And only keep what makes sense: why do my Apple Installer Cache directories have kits for three different versions each of iTunes, QuickTime, and Safari, and seven kits for Apple Mobile Device Support? Why do I need to keep installers for versions that have already been superceded?

» 7 reactions

Comments

[gravatar]
Charles Merriam 10:22 PM on 4 Sep 2008

If you can dual-boot to a different operating system try:
$sudo apt-get install filelight
$filelight

Filelight provides a quick graphical view of where you space is actually used. Pretty pictures.

[gravatar]
Foone 8:11 AM on 5 Sep 2008

Nice script, very handy.
One bugfix, though: On systems with symlinks, os.stat can throw a file-not-found OSError for a filename you got back from os.listdir.
I stuck in a try:except:pass block and the script worked fine. (This'd also help protect against the case of a file disappearing between when you listdir it and when you stat it)

[gravatar]
George Reilly 4:20 PM on 5 Sep 2008

There are lots of free disk space visualizers with treemaps: http://lifehacker.com/software/disk-space/geek-to-live--visualize-your-hard-drive-usage-219058.php

I've been very happy with WinDirStat, Disk Inventory X (OS X).

[gravatar]
_Mark_ 12:33 PM on 6 Sep 2008

Not to excessively critique an admittedly personal-use "cobbled together" script (it does *work* after all :-) but "writing C code in Python" is a pattern I've seen among coworkers too. The script would be a bit more readable (and probably more writable) if you'd used os.path.getsize and os.path.isdir, instead of the more brutal C "stat" equivalents... you might also consider os.walk for the iteration. (In fact I just noticed, at least in python2.5, that "pydoc os.walk" includes a fairly elegant version of the above script...)

[gravatar]
Ned Batchelder 1:17 PM on 6 Sep 2008

Mark, the critique is welcome. To be honest, none of that os.* file stuff sticks in my head, and every time I need a script like this I go find the last similar one I wrote, and copy and paste excessively. The os.path functions you mention are of course better than the os.stat stuff. I need to concentrate harder on os.walk to know if it iterates in the way I want.

And when I wrote the post, I hesitated a moment, thinking, "is this the best way to write this script, because I don't want to put it up there if it's lame in some way." And then I figured it was tangential to the real point, and useful even if boneheaded (it did solve my problem after all), so I posted it anyway. Damn the torpedoes!

[gravatar]
Pedro 11:28 AM on 7 Sep 2008

Foone highlighted that symlinks could cause exception throw, in addition it cause infinite loop in directory recursion. Nowadays VISTA (I supose linux) allows to create directory symbolic link thanks to MKLINK command. The directory structure is no longer a tree, but it is a graph with cycles.

[gravatar]
Michael Baltaks 11:15 PM on 9 Sep 2008

"For many of these I don't know whether it is safe to delete them."

That's a sad state of affairs, anything in a folder marked as cache should be regenerated as needed in the normal operation of the software, so it should always be safe to delete. Is this not true? Is this one of those "my programmer's 6th sense says warning" situations, or have you been burned by deleting a cache folder before?

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.