« | » Main « | »

Keep data out of your variable names

Saturday 31 December 2011

I saw this question this morning:

I'm adding words to lists depending on what character they begin with. This seems a silly way to do it, though it works:

nouns = open('nouns.txt', 'r')
for word in nouns:
    word = word.rstrip()
    if word[0] == 'a':
        a.append(word)
    elif word[0] == 'b':
        b.append(word)
    elif word[0] == 'c':
        c.append(word)
    # etc...

Naturally, the answer here is to make a dictionary keyed by first letter:

words = defaultdict(list)
for word in nouns:
    words[word[0]].append(word)

The question reminded me of others I've seen on Stack Overflow or in the #python IRC channel:

  • How do I see if a variable exists?
  • How do I use a variable as the name of another variable?
  • How do I use a variable as part of a SQL table name?

The thing all these have in common is trying to bridge the gap between two domains: the data in your program, and the names of data in your program. Any time this happens, it's a clear sign that you need to move up a level in your data modeling. Instead of 26 lists, you need one dictionary. Instead of N tables, you should have one table, with one more column in it.

These situations all seem really obvious, but there are more subtle situations where this dynamic appears. I just wish I could think of an example! :)

Happy Hanukkah

Tuesday 20 December 2011

Tonight is the first night of Hanukkah. Of course, most calendars confusingly label tomorrow with "Hanukkah starts." I wish instead they would label today with "Hanukkah starts tonight."

I've chatted with plenty of friends over the years about the different natures of Christmas and Hanukkah, and finally hit on an apt metaphor: Christmas is like the Super Bowl, Hanukkah is like the World Series.

Like the Super Bowl, Christmas is one concentrated day. For celebrants, it is the only thing happening that day, and it's preceded by weeks of anticipation and preparation. Another similarity is the zeal with which businesses try to piggy-back on the excitement.

Hanukkah, on the other hand, is spread out over about a week, like the World Series. While the daylight passes as it normally would, the evenings are spent specially, and the entire week is tinged with a special feeling because of it. If you have to miss one night for another reason, that's OK, you've got others you can celebrate.

Each has its special feeling, either because of its intensity, or because of its calm. I enjoy them both.

If you celebrate Hanukkah, take pleasure in lighting your candles tonight!

Stop SOPA

Thursday 15 December 2011

The Stop Online Piracy Act is the latest battle in the war between the movie studios and the rest of us. It's a really bad bill that, depending on who you listen to, would either kill jobs in the US, or radically change the entire Internet, or both.

If you read this blog, you've probably already been pelted with calls to action. You may have figured you wouldn't do anything because it wouldn't have an effect. Do something anyway. It's the right thing to do.

When I was working for Hewlett-Packard, one of the advantages over being at a startup was that we could sign licensing deals with other really big companies, like Disney. I met once with a pair of HP guys who oversaw that sort of work. They used to work for the movie studios in a similar capacity. The day I met with them at our offices I happened to be wearing an EFF t-shirt, just because it was the top one on the pile that morning. When they saw me in it, they looked at me like I was carrying a sign that read, "I shoot babies."

They proceeded to tell me a story about a high-end TV console manufacturer that was building a $50k device that would let you feed in DVDs, and rip them to an internal hard-drive, then you'd have your entire collection of movies on a menu that you could access instantly. The point of the story was that the movie studios forced them to stop making this product, because although it was a closed system that would appeal to their best and richest paying customers, it involved ripping DVDs, and that was bad. So they were closing off new avenues of business because of the internal technology involved. So dumb.

One of them then launched into a rant about what would the world be like if the movie industry was put out of business by piraters? How would people be entertained? "The movie industry performs a valuable service, finding and producing the best entertainment!" Movies are great, it's true, but there's no constitutional right to the existence of a movie industry. It's existed for only about 100 years. Not only would people entertain themselves without it, just as they did before the movie industry existed, they may find better ways to do it.

Technology changes, and businesses sometimes suffer. The horse-buggy industry didn't try to write laws preventing automobiles. The vaudevillians didn't lobby their congressmen to stop movies. Radio didn't restrict freedom of speech in an effort to block TV.

I don't want the movie industry to go out of business, and I don't think it will. Their desperate actions to stop their perceived enemies are wrong-headed and technologically a bad idea.

SOPA is like burning down a house to get rid of the mice. How far are we willing to let these industries go in their desperate effort to stop piracy? Piracy's not a good thing, it's true, but the "solutions" in the SOPA are worse.

Call your congressman.

Deleting files, keeping a few

Monday 12 December 2011

This is one of those conceptually easy tasks that seems frequently required, and yet needs a complex incantation to accomplish. I have a series of files, and it will grow over time, and I want to clean them up, but keep the most recent N files.

After poking around the Google, I found this for deleting PATTERN, but keeping the five most recent:

ls -t1 PATTERN | tail -n +6 | xargs -r rm -r

That's dash-t-one on the ls command. Or, in words:

  1. List files matching PATTERN, in descending order of modification time, in one column,
  2. Pass through all the trailing lines, starting with the sixth from the beginning,
  3. Bundle all those filenames into an "rm -r" command, but not if there are none.

That wasn't so hard, was it??

Duplicitous Django settings

Tuesday 6 December 2011

Django is easily the most popular Python web framework these days. For all of its features, and ease of use, though, sometimes it just seems misleading on purpose. This morning I fixed a mysterious problem, and once again I was reminded of how Django can seem simple until things go wrong, and then it's weirdly complex.

In particular, how the settings work is just odd. There are two ways that Django does two things when it would be better to do only one.

For Ibis Reader, our settings machinery is elaborate: the settings file imports from product_settings.py, then from a host-specific settings file, then from a local_settings.py which isn't committed to source control:

# Settings.py
    
#.. lots of settings ..

from product_settings import *

# Settings particular to this host.
# For a host named xyz01.myapp.com, 
# create a file host_settings/xyz01_myapp_com.py
import platform
host_name = platform.node().replace('.', '_').replace('-', '_')
try:
    exec "from ibis.host_settings.%s import *" % host_name
except ImportError:
    pass

# Last resort (good for dev machines): 
# import settings that aren't in the repo.
try:
    from local_settings import *
except ImportError:
    pass

This scheme works great: you can put settings in the file that corresponds logically to why the setting needs the value.

But something odd was happening: if a setting was in both product_settings.py and the host settings file, then the value in product_settings won. How could this be? The host settings file is applied after product_settings!

Part of the answer is the first thing that Django does twice that should only happen once: the settings file is imported twice. This flies in the face of everything we know about Python modules, but it happens. So the actual order of imports for my settings files is:

  1. from product_settings import *
  2. from ibis.host_settings.my_host import *
  3. from local_settings import *
  4. from product_settings import *
  5. from ibis.host_settings.my_host import *
  6. from local_settings import *

I don't know why Django imports twice, but it's long been true, and I've had to rediscover it the hard way a few times.

But this still doesn't explain the mystery: every time product_settings is applied, host settings should then be applied over it, so why would a setting in product_settings take effect over one in host settings? The answer is in the second thing that Django does twice: adding directories to the Python path.

I don't know if this is really Django's fault, or something about the way people seem to always configure their Django projects, but it seems to very often be true: your source files are available through two different import paths, because your source tree has been added to the Python path twice at two different levels.

A Django project has a top level corresponding to the project ("ibis" in this case), and then apps beneath that. The Python path is constructed so that you can import a file as "my_project.my_app", or just as "my_app". Except that for some reason, this double-view of the source tree isn't always available, and it isn't during that second series of settings imports!? The path is being modified between the two import sequences!

So the import march actually looks like this:

  1. from product_settings import *
  2. from ibis.host_settings.my_host import *
  3. from local_settings import *
  4. from product_settings import *
  5. from ibis.host_settings.my_host import *: Import failed!
  6. from local_settings import *

The net result is that settings in both product_settings and host settings will keep the value from product_settings, even though host settings is imported second.

The fix is really easy: remove "ibis." from the host settings import line, taking advantage of the fact that either form will work, and in fact, the second form is more robust since it seems to always be available on the Python path. The settings files still get imported twice, but at least the same thing happens both times.

I still don't understand why all these things happen. I hope part of this is my fault, because then I can fix it for real.

Maintenance hatches

Saturday 3 December 2011

Here's a pattern I've repeated too many times to count: build some software, deploy the software, software doesn't work, wish I had more data points to figure out why. This is typical, and is often remedied after the fact by adding more logging. Everyone knows: you want tons of logging.

There's another tool that I don't use enough until I need it: maintenance hatches. On a physical machine, you need to be able to get at the inner workings of the thing to observe it, fiddle with it, and so on. The same is true for your software.

On a web site, these hatches take the form of URLs intended only for the developers and maintainers of the site to use. As a simple example, do you have a way to test the error handling on your live production server? If you want to see what happens when an exception occurs, you need a way to raise an exception on your live site. But you've tried hard to make sure that never happens in your code. Here's a view that will do it for you:

@staff_member_required
def raise_error(request):
    """Raise an exception.  How else will we know our stack traces work?"""
    msg = request.GET.get("msg", "Something bad happened! (on purpose)")
    raise Exception(msg)

When you visit this URL, it raises an exception, simple. The message defaults to something that indicates it was intentional, but for convenience, you can provide your own message as a parameter on the URL. I've made it available only to staff members so that it can't become a nuisance doorbell, and so that search engines won't accidentally trigger it.

Once this view is in place, you'll have a maintenance hatch that lets you look directly at a small part of your complex machinery. There are lots of other diagnostic tools that are possible:

@staff_member_required
def send_email(request):
    """Send an email to test the mail-sending infrastructure."""
    msg = request.GET.get("msg", "Test of sending email")
    send_mail(
        msg, 
        'The body also says "%s"' % msg, 
        settings.DEFAULT_FROM_EMAIL, 
        [request.user.email], 
        fail_silently=False
        )
    return HttpResponse("An email was sent to %s" % request.user.email)

@staff_member_required
def dump_settings(request):
    """Dump all the settings to the log file."""
    log.info("----- Django settings:")
    for a in dir(settings):
        if a.startswith('__'):
            continue
        log.info("%s%r" % (a, getattr(settings, a)))
    return HttpResponse("Settings have been logged.")

As you get deeper into your product-specific code, you'll get away from simple general views like this into things that will only be useful to you, which is why you have to build them yourself rather than finding an off-the-shelf application.

These examples are for Django, but the principle is the same for any software, it isn't limited to web applications.

One more I've found very useful: spawn a Celery task, to figure out if that machinery is properly configured:

@staff_member_required
def task_ping(request):
    """Send a simple task to a worker queue."""
    msg = request.GET.get("msg", "Task ping!")
    pingtask.delay(msg)
    return HttpResponse("Sent a task with message, '%s'" % msg)

@task
def pingtask(msg):
    print "Ping: %s" % msg

Often these views are written as a reaction to a specific problem, and then are forgotten, but they can be useful tools in the trenches. Write them for keeps, and document them so your staff knows they're at their disposal, and they'll be useful to you in the future.

« | » Main « | »