Long-running restartable worker

Monday 13 June 2011

I had to write a program that would analyze a large amount of data. In fact, too much data to actually analyze all of. So I resorted to random sampling of the data, but even so, it was going to take a long time. For various reasons, the simplistic program I started with would stop running, and I’d lose the progress I made on crunching through the mountain of data.

You’d think I would have started with a restartable program so that I wouldn’t have to worry about interruptions, but I guess I’m not that smart, so I had to get there iteratively.

The result worked well, and for the next time I need a program that can pick up where it left off and make progress against an unreasonable goal, here’s the skeleton of what I ended up with:

import os, os.path, random, shutil, sys
import cPickle as pickle

class Work(object):
    """The state of the computation so far."""

    def __init__(self):
        self.items = []
        self.results = Something_To_Hold_Results()

    def initialize(self):
        self.items = Get_All_The_Possible_Items()

    def do_work(self, nitems):
        for _ in xrange(nitems):
            item = self.items.pop()

def main(argv):
    pname = "work.pickle"
    bname = "work.pickle.bak"
    if os.path.exists(pname):
        # A pickle exists! Restore the Work from
        # it so we can make progress.
        with open(pname, 'rb') as pfile:
            work = pickle.load(pfile)
        # This must be the first time we've been run.
        # Start from the beginning.
        work = Work()

    while True:
        # Process 25 items, then checkpoint our progress.
        if os.path.exists(pname):
            # Move the old pickle so we can't lose it.
            shutil.move(pname, bname)
        with open(pname, 'wb') as pfile:
            pickle.dump(work, pfile, -1)

if __name__ == '__main__':

The “methods” in the Strange_Camel_Case are pseudo-code where the actual particulars would get filled in. The Work object is pickled every once in a while, and when the program starts, it reconstitutes the Work object from the pickle so that it can pick up where it left off.

The program will run forever, and display results every so often. I just let it keep running until it seemed like the random sampling had gotten me good convergence on the extrapolation to the truth. Another use of this skeleton might need a real end condition.


For a bit of extra robustness, you may want to write the new pickle before you move the old one:
tmpname = pname + ".tmp"
with open(tmpname, "wb") as pfile:
  pickle.dump(work, pfile, -1)
if os.path.exists(pname):
  shutil.move(pname, bname)
shutil.move(tmpname, pname)
That way if something goes wrong with the serialisation, the last known good pickle will still be in place.
You might also want to keep a copy of the pickle file before you replace it.

I've had situations like the above where running out of memory causes the pickle to get corrupted during write...
Chaps, poo happening requires `cp work.pickle.bak work.pickle`. Anything more is over-engineering, surely?
A question: How to extend this to allow parallel workers? Perhaps one pickle object per worker, or one pickle object representing the state of many workers?
@Nick: I see that your extra step would leave the pickle in the proper place. I was willing to rename the backup file by hand if something really bad happened.

@Chris: I do keep a copy of the pickle, or are you talking about something else?

@Bill: The simplest thing for parallel workers is for them to somehow segment the population of items. For example, each worker when started is given a number N: 0, 1, 2, 3. Then they only work on items whose id is N mod 4. Each worker writes their own pickle with their number in the file name, and a separate program knows how to read the pickles and combine them together. This isn't very sophisticated, but is simple.
Re Ned's response to Bill: the map/reduce/rereduce pattern seems to fit.
If the data are so huge that you can't process all of them, doesn't loading of all data in Work and the frequent pickling take too much time ? (and memory)
@Yannick: I'm not loading all the data, I'm loading all the ids of the data. Each work item involves reading data from the db, reading associated files on disk, etc.
My god just use a queue? Beanstalkd, bam now you have a binlog and can have multiple workers on multiple computers. Or use GAE and essentially the same thing wih defer. Tricky though!! I like beanstalkd best though. Also turtles

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.