Keep data out of your variable names

Saturday 31 December 2011

I saw this question this morning:

I’m adding words to lists depending on what character they begin with. This seems a silly way to do it, though it works:

nouns = open('nouns.txt', 'r')
for word in nouns:
    word = word.rstrip()
    if word[0] == 'a':
        a.append(word)
    elif word[0] == 'b':
        b.append(word)
    elif word[0] == 'c':
        c.append(word)
    # etc...

Naturally, the answer here is to make a dictionary keyed by first letter:

words = defaultdict(list)
for word in nouns:
    words[word[0]].append(word)

The question reminded me of others I’ve seen on Stack Overflow or in the #python IRC channel:

  • How do I see if a variable exists?
  • How do I use a variable as the name of another variable?
  • How do I use a variable as part of a SQL table name?

The thing all these have in common is trying to bridge the gap between two domains: the data in your program, and the names of data in your program. Any time this happens, it’s a clear sign that you need to move up a level in your data modeling. Instead of 26 lists, you need one dictionary. Instead of N tables, you should have one table, with one more column in it.

These situations all seem really obvious, but there are more subtle situations where this dynamic appears. I just wish I could think of an example! :)

Comments

[gravatar]

locals() is your friend: 'var_name' in locals(), locals()['prefix_' + var_name].

Third question indicates too complicated design, unless you are writing another ORM.

[gravatar]

@void: Ugh, no! The whole point is to use a dict instead of doing variable name hacks like this with locals(). This is the kind of literal answer that does no good for beginners to hear. You need to find out what they are really trying to accomplish, and give them the appropriate tools, not just answer their narrow mis-guided question.

[gravatar]

Another way to do this:
nouns = open('nouns.txt', 'r')
for word in nouns:
exec '{0}.append(word)'.formar(word[0])

[gravatar]

About once per decade you come across a legitimate reason for testing for the existence of a variable. E.g., the following JavaScript I wrote the other day:

// Usable as a Node.js module or directly if 'exports' isn't defined.

var rfc3339;

try {
    // Assign if being used in Node.js or other CommonJS environment.
    rfc3339 = exports;
} catch(e) {
    // Not in a CommonJS environment, assume just in a browser or something.
    rfc3339 = {};
}

....

rfc3339.parse = parse;
rfc3339.format = format;

[gravatar]

@dudi: Again, this is making me sad. Why create code like this? The point isn't to find weird obscure ways to get the job done. Would you put code like this into your own project? That isn't engineering, it's stupid pet tricks.

[gravatar]
Alex Kesling 7:56 PM on 2 Jan 2012

Let's take a step back for a second and just discuss some basic variable handling and protection which we must handle ourselves in python. In many real-world cases our file won't be perfectly formatted.

The first thing to think about in this case is that there may be leading whitespace on the line... this means we must use "word.strip" in place of "word.rstrip". That's pretty straightforward, now we can handle " foo" as well as "bar\n".

Next we have something that may not be quite so obvious at first glance. What if we have two newlines in a row _somewhere_ in the file, or if we have a line composed of purely whitespace characters. In these cases the use of "word[0]" is actually problematic...

Providing we aren't doing extra steps to guard against this string state, this will result in "word.rstrip()" being of length 0 and thus indexing will throw an exception. In the first example the pythonic solution would be to use ".startswith" as in "word.startswith('a')" (as an added bonus, ".startswith" can check for strings with more than one character, as in "word.startswith("foo")", but that's tangential). This will return True in the case that the first character is 'a' and False for all other cases. This only handles the initial proposed code, not the cleaner version... so....

In the case of the dictionary approach, the use of indexing into a potentially 0 length string is still dangerous, but since we are using it directly as an index we can't use our new-found ".startswith" method. This sadly means that we have to fall back to a guard or do some gymnastics (and there are gymnastics we can do... but aren't we trying to avoid those?). Either we surround the block with a try/except or we add a conditional checking for some minimum length.

[gravatar]

@Alex, wow, we weren't talking about those issues! I have no idea what the original person's data file looked like, so I don't know whether this much care is needed.

[gravatar]

Reminds me of situations where inheritance is used instead of aggregation and/or a strategy object. It leads to class hierarchies with leaf names like DirectDepositLifePolicy, DirectDepositGeneralPolicy, MonthToMonthLifePolicy, MonthToMonthGeneralPolicy etc, coupled with nested-if statements to create the correct type of object :(

[gravatar]

Without knowing anything at all about this, really, here's my question: when you rely on a dictionary, aren't you relying on the quality of the dictionary? What if new words appear, or people are inputting names? It seems to be to be more robust to just look at the first letter, even if that isn't so elegant. The example you gave would work for any word written in the Latin Alphabet, which is pretty robust. On the other hand, the dictionary approach weeds out nonsense and typos. It seems that the dictionary would get stumped a lot more often, which might be good or bad. Oh well, back to lurking.

[gravatar]

@Paul, we have a terminology conflict here: In the real world, "dictionary" means a book containing words compiled by a lexicographer, and so it by definition a pre-determined set of words. In Python (and some other computing arenas) a dictionary is a data structure for storing a value under a key so that later if you have the key you can find the value. Any value can be stored with any key. In this case, the values are lists of words, and the keys are the first letter of the word.

The names are the same because both real-world dictionaries and Python's dictionaries are ways of finding something if you have the key. In the real-world dictionary, the key is the word, and the value is the definition. Ironically, though, they do this in very different ways. In the real-world dictionary, it is essential that the values be recorded in order, sorted by their key. In computing dictionaries, it is much faster if they are stored in an apparently random fashion, so that the dictionary has no sensible ordering at all!

[gravatar]

I'm coming across this very late, but let me quickly note how incredibly true and important the post is (and how frustrating it is that so many comments miss the point!) Like you, I run into this kind of question regularly on Stack Overflow, and am glad to have somewhere to refer askers.

[gravatar]

@Alex did that really require 5 paragraphs?

nouns = (word.strip() for word in open('nouns.txt', 'r') if word.strip())
words = defaultdict(list)
for word in nouns:
words[word[0]].append(word)

[gravatar]

Wow.
I read this article and thought
"well yeah, thats obvious"

But after reading the comments, I would like to cry. You basically said
"water is wet"
and people argued with you.

[gravatar]

I just realized I forgot to let you know I was linking to this post.

Basically, I wrote a much more long-winded post about the same thing last year, someone pointed out that you'd written a much simpler version of the same thing, so I added a link to yours from the top of mine. (You can see it at http://stupidpythonideas.blogspot.com/2013/05/why-you-dont-want-to-dynamically-create.html).

[gravatar]

And while I'm here, one comment on your last comment (which I'm sure you already know, but people like Paul Downs may not have thought of): Sometimes storing dictionaries in sorted order (which can be binary-searched) is more helpful than in arbitrary order. The down side is that it's a little slower (logarithmic instead of constant) and the underlying data structure can be more complicated, but the tradeoff is good enough that many languages' standard libraries provide both kinds of dictionaries (e.g., unordered_map and map in C++). Python's doesn't, but people have suggested changing that many times. (I have another blog post about that, http://stupidpythonideas.blogspot.com/2013/07/sorted-collections-in-stdlib.html, if anyone's interested.)

[gravatar]

I would like to use one example, tho very obscure, where I found the exec() command very useful. It had to do with writing code that was more easily understandable (Weird, I know, right?).

I'm just completing my masters in structural engineering and one of my homework's had some very long and nasty equations that I had to program to run a monte carlo simulation. (yes, statistics; now you know what I mean be 'long and nasty equations')

At first looked something like
p=1, fy=2,alpha=3,...
variable[P]*variable[fy]^variable[alpha]+variable[...]...

and I realized this was stupid. I used exec(P=value), etc. and ended up with code more like

P*fy*(1-alpha*b*c/sigma....

This was very useful because I had to code in a number of such equations with varying levels of complexity, and with this method I could literally copy-paste my original equations directly into my code in an immediately understandable format.

Ok, I went through and found my code that used this:

if a < .414*l
g_Defl = l / 240 - P * a * (l^2 - a^2)^3 / (3 * E * I * (3 * l^2 - a^2)^2);

else
g_Defl = l / 240 - P * a * (l - a)^2 / (6 * E * I) * sqrt(a / (2 * l + a));
end

if g_Defl < 0
counterDefl = counterDefl + 1;
end

%check moment
Mload = P * (l - a)^2 * (a + 2 * l) / 2 / l^3 * a;
Mfixed = P * a * (l - a) * (a + l) / 2 / l^2;
g_M = Fy * Z - max(Mload, Mfixed);

[gravatar]

Thanks Ned. I'm a newbie and was trying to break out a list of lists into a bunch of separate lists, assigned to variables. While I don't understand most of what you're saying in your post, I did understand that I need to "move up a level in my data modeling". Now, I don't really know how to do that and your post wasn't enough to go on, but at least I know I have to! I've only been programming python for about 6 hrs now, so I guess I'll go look up how to make a dictionary from a list of lists (or tuples?). Thanks for pointing me towards elegance!

[gravatar]

@Ned Batchelder "Would you put code like this into your own project? That isn't engineering, it's stupid pet tricks."

Do you really perceive programming as engineering?

[gravatar]

@Furjoza: yes, I do. Programmers solve problems by understanding the possibilities of a technology, and choosing among them, making tradeoffs along the way.

[gravatar]

Hi Ned,

Thanks for writing on this topic.

I came across a problem on the Jet Brains academy that supplied variables like this in the template of the problem:

bloomberg_com = "something"
nytimes_com = "something else"

The problem then asked the student to print the values of those variables when the user inputs "bloomberg.com" or "nytimes.com".

You had a lot of students using eval() and values().keys() in order to access those variables based on the input, rather than using conditionals as other students did. Changing the problem template was obviously discouraged, even though it was not explicitly prevented, since no one changed the structure to just use a dictionary as you would probably suggest.

So it seems that educators are sometimes inviting the sorts of hacks you are denouncing here by supplying study problems that are either too inflexible or which lack proper direction. Poor habits forming early?

[gravatar]

Quick correction: I meant locals().keys not values().locals() !

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.