|Ned Batchelder : Blog | Code | Text | Site|
I was walking through Copp's Hill burying ground in Boston's North End today. The gravestones are fascinating, partly for the concrete connection to a distant past, but also for their antiquated style.
For example, many of the stones had ye in place of "the." That y isn't actually a y, it's a thorn, a letter that fell out of use in English a few hundred years ago. Thorn is prounounced th, so ye isn't "yee", it actually is "the."
After noticing how older stones had ye and newer ones had "the", we came across Mary Ela's stone:
The odd thing here is a single gravestone that has both THE and Ye. Why use both forms? Why in one place but not the other? Perhaps because the English sentence flowed more naturally, but the dates and age felt like a conventional form?
The other transitional note is the date: 1737/8. The 7-or-8 was not because her date of death was unknown. It's because the world didn't agree on how to number years.
Our current calendar is the Gregorian calendar, a work of some engineering and design. Before 1752, years started on the vernal equinox, around March 21st. It wasn't until 1752 that people agreed that the number of the year should be incremented on January 1st. In 1737, some people used year numbers incremented on January 1st, and some used years incremented in March. Since Mary Ela died in early March, some would have called the year 1737, but some would have called it 1738. Putting both numbers on the gravestone makes it clear when she died.
By the way: the year starting in March explains why September, October, November, and December are named as the 7th, 8th, 9th, and 10th months: if you count March as the first month, then the numbering makes perfect sense.
It's fascinating to think back to those times. We deal now with confusing timezones and character sets, but at least the English alphabet and the number of the year are simple, right? Well, back in 1737, you couldn't even count on that.
Coverage.py reads a configuration file, which by default is in .coveragerc, with a leading dot. For years I thought Pylint had no default config file, because it wouldn't find .pylintrc, it turns out it looks by default for pylintrc, no leading dot.
Which is correct?
I guess I was modeling .coveragerc on .bashrc, .vimrc, and all the other files that clutter home directories everywhere. But is that right? I asked on twitter a few months ago:
A few people said they should have dots if they are in your home directory, which is clearly true. But these config files are not meant for the home directory.
Hmm, an interesting point. So is .coveragerc essential to coverage.py? It's only for overriding defaults, so it isn't required. But it does specify how coverage.py should behave.
Should it be coveragerc instead? Or coverage.rc? Opinions? Of course, .coveragerc files will still be recognized if a new default is used. I know this is a small point, but I'd like to follow the consensus if there is one.
The fallout from PyCon this year has been dramatic, involving Adria Richards, Alex Reid, SendGrid, PlayHaven, and the PyCon organizers. I wasn't involved in this event at all, so I have no first-hand knowledge of it, but it saddens me greatly. So many things have happened that I wish had not happened.
Improving community is difficult. Getting 2500 people together without friction is impossible. Friction and offense will happen, the question is, what do we do about it? It seems to me there are two mindsets about how to improve a community.
The first mindset is, "Let's get rid of the assholes, and the people that are left will be a great community." I'll call this the shunning model: identify the Bad People, get rid of them, and you will have only Good People left.
The second mindset is, "We're all different, and we're going to make mistakes, so let's be thoughtful and educate each other." I'll call this the educating model: people are imperfect, but basically good, and if we can keep an eye on things and keep communicating, we can all improve.
When I look back at the aftermath of PyCon, I see a number of events that fit into the shunning model, and few that fit into the education model. This to me seems to be the heart of the problem.
We often talk about building an inclusive community, which usually means that women should be as welcome as men. I want it to also mean that people who make mistakes can be kept as members. Clearly, some people will be difficult enough that they won't be welcome, but most people who offend are good people who've made a mistake, not incorrigible assholes. I don't want a One Strike And You're Out community.
Let me tell you about my experiences at PyCon. I had at least three incidents of "community friction" during my time there:
In incident #1, I was the offender, and I'm really glad I was educated instead of shunned.
In incident #2, I was the offended. The member in question has been banned. I wasn't part of deciding the sanctions, but am glad to see in his blog post that he is thoughtful about what happened.
In incident #3, I was the offended, but did nothing. If I had known the speaker better, I might have said, "I wish you wouldn't use that word that way," but it didn't seem right at the time.
Friction is inevitable. One of the great things about PyCon is that it is right at the boundary between being comfortable with old friends, and meeting new people. There are bound to be incidents. We have to accept that, and try hard to talk to each other to improve things for everyone.
Education is better than shunning.
Here at PyCon, attendees get the usual ribbons dangling from their badges: Speaker, Sponsor, etc. They're always a topic of conversation, especially if you have three or more, or are wearing any of the joke ones like "Workaholic" or "King".
But most attendees have no ribbons, and it occured to me we could have tokens for everyone. I thought of them as merit badges, but the more current terminology would be acheivements.
And so on. Could be fun.
I gave a talk yesterday at PyCon 2013 called Loop Like a Native. The main point was: write more generators. Give it a look.
I have long been interested in the technology of printing, especially typography. Two years ago, when I went to the Boston Printing Office auction, I met Keith Cross, a lecturer at MassArt. He teaches letterpress classes, usually spread over the course of a semester, which I wasn't able to commit to. When he announced a one-day Saturday workshop, I jumped at it. It happened yesterday, and I loved it.
The workshop is hands-on: you start off standing at a type case, and with a few minutes of verbal instruction, you start putting type into a job stick:
I was familiar with the concepts of cold metal type, job sticks, California cases, and so on, but had never had a chance to try it myself. I was really pleased to be able to work with these little pieces of metal, and set actual lines of type.
You don't need to know the history of type to take the class, Keith explains what you need to know, and walks around gently guiding people to get them over rough spots.
I was having fun, and so went a little overboard. I wanted to use the fancy ligatures, so I included words like, "fluffy waffles," and when I saw that my type case had a "gg" ligature, I had to add "eggs." Then I discovered it had "zy" and "gy" ligatures, so I couldn't stop until I got "syzygy" in there!
Each student produced a few lines of type, then we all headed over to the press to assemble them together into a page:
Once we had all contributed our lines, Keith added larger blocks of iron and wood, known as "furniture" around it to hold it all in place, and locked it in with quoins.
A little discussion of paper, the mechanics of the press, then Keith rolls the press, and there's a printed sheet!
The first sheet is called a proof, because it's used to check the typesetting. The next step is to read the proof, an activity known as "proofreading"! One of the things I find fascinating about type is how history bleeds through into the present in the terminology and conventions. "Leading" makes much more visceral sense when you are holding a 6-pt thick bar of lead alloy, and place it between your rows of type.
Mistakes were identified, as were worn pieces of type. The form in the press was unlocked, and corrections made, by pulling out bits of type with a tweezers, inserting new ones, adjusting spacing, and so on. Then we each printed a sheet of our own.
When the printing was done, and the type had been cleaned, we each retrieved our lines of type, and put each tiny piece back into the proper compartment. Thinking about how long it took to find the type, set the type, adjust the type, print the type, and then clean up and put away the type, I am amazed anew that books, newspapers, and even encyclopedias ever got printed. This was very labor-intensive, dirty work. A few hours in a letterpress shop, and it is clearly industry.
After lunch, we each worked on a project of our own. I chose a quote from E.B. White:
I liked it not only for what it says, but because the words "great" and "small" could be used for a little type expression.
Here's where the old ways really seemed difficult! I wasn't sure what layout to use, and ended up changing my mind half-way through. So I had to move rows of metal chunks from one line to the next, and hope not to drop the whole thing.
Each line has to be completely packed tight with metal so that when the type is on the press, it will all be held tightly in place. This means that you need to find just the right thickness to fill the gaps, perhaps even slivers of brass to finish a tiny space.
I wanted an em-dash, but there wasn't one in my type case. We made one with a 36-pt piece of metal, spaced properly with yet more tiny chunks of metal. After dealing digitally with different kinds of spaces and dashes and leading, it was a new perspective to have to actually build it from pieces of metal.
To add to the complexity, the quote is set in 36-pt Centaur, but the word "small" is in 30-pt. I had to make up the 6-pt difference with strips of metal just as long as the word "small", but some above and some below to get the baseline right. I'd like to have used 24-pt for greater differentiation, but the next available size below 30 was 18. Another analog limitation.
When it was my turn for the press, Keith set up the form and locked it in, and I printed off a few dozen on gold paper:
Even before I printed it, there were things I wished I could improve about the layout, but time (and ability!) were short. The stars are there kind of as a distraction, but also because it's fun to use ornaments. But they are too low for the line: you can see in the photo, they are cast on a 36-pt body like the rest of the type, but they have a much lower baseline. I'm not even sure how I would have adjusted for that even if I had had time.
But I'm still very pleased with the result:
Keith made the day fun, he's friendly, passionate, knowledgable, and helpful. I highly recommend his class, it's a great way to see what letterpress is all about. It gives you a whole new perspective on type, and the technology that brought us to our current tools and techniques.
This week I began a new job at edX. Well, it isn't really a new job, I've been freelancing with them since October, but now I'm an employee.
EdX is a non-profit formed by Harvard and MIT to put their courses online. Other top universities have joined, including Berkeley, Rice, McGill, the Delft University of Technology, and others.
Online learning is huge these days, despite being saddled with the worst acronym ever (MOOC, Massive Open Online Course), so there's no shortage of interesting questions:
There are plenty of smart people at edX working on these questions, and I think we have a good chance at finding the right answers. There's stiff competition from Coursera and Udacity, but edX is a bit different, both because it is a non-profit, and because we are chartered by our universities to help them change on-campus education as well as online education.
Almost everything is written in Python and Django, and we're aiming to open-source what we can.
I'm excited. I enjoyed freelancing for a few years, now I know how that works, and I might go back to it someday. But it feels good to be in the middle of edX, helping build something great.
My mom's wordpress site has some malware on it, and she sent it to me for a professional opinion. The mystery file was called wp-rss3.php. Looking at it showed that there was source code being encoded in it, so understanding what it did would require decoding the data. I fired up a Python prompt, and started picking away.
Read the file, and take a quick look to see what structure it has:
>>> wprss3 = open('wp-rss3.php').read()
The file is one long line, so let's split it into lines:
>>> wprss3 = wprss3.replace(' ', '\n').replace(';',';\n').splitlines()
OK, six lines, one of which has the bulk of the data. Let's look at them:
The line 0 is uninteresting, but line 1 defines a string using hex escapes. Lots of our steps here will require getting raw data from a string that is the bulk of what we're looking at. Splitting on double-quotes will get us pieces, one of which is the one we want. Rather than counting pieces to find the right one, we know the one we want will be the longest piece. So we can use max() to find the longest piece:
>>> d = max(wprss3.split('"'), key=len)
One of Python's handy-dandy decoders is 'string_escape' which can turn a string with backslash-x sequences into the correct string:
OK, so $_8b7b is "create_function", a PHP function. Let's see what line 2 gives us:
Interesting, now for the bulk of the data, line 3:
Mentally using our definitions of $_8b7b and $_8b7b1f, this is equivalent to:
$_8b7b1f56 = create_function("", base64_decode("JGs9MTQ...Hop0w=="));
BTW, I did not know that PHP would execute function names in strings as simply as $fnname(), but it does not surprise me.
What's in the base64 data?
>>> d = max(wprss3.split('"'), key=len).decode('base64')
The decoded data is 20k long, and visual inspection shows that the middle is just lots of numbers separated by semicolons. The PHP code is decoding those numbers by XORing them with 143, using them as ASCII codepoints, and evaluating the result. So we want to perform the same decoding to see what source code results:
>>> nums = max(d.split('"'), key=len).split(';')
This finally shows us the source of the backdoor which is executed when the page wp-rss3.php is visited in a browser. I've reformatted it here slightly just to break long lines:
error_reporting(E_ERROR | E_WARNING | E_PARSE);
As you can quickly see, this is a nasty piece of work: it takes commands from the client and will execute PHP code, or SQL, or OS shell commands. I don't understand all the back and forth of the forms handling here, but it doesn't matter, it's clearly intended to let a remote attacker have his way on your machine. Bad stuff.
I wonder if a Wordpress installation could be checked for malware by looking for files that are too high a proportion of base64-encoded text?
I told my mom to remove the file, but I suspect there will be more cleaning up to do...
I sat in on a beginner's programming class a few weeks ago, and I was struck by the bizarre words we routinely use, but which must sound like nonsense to beginners.
Take the simple program:
print "Hello, world!"
What is the word "print" doing here? Printing means to produce marks on a piece of paper. There's no paper involved. And "Hello, world!" is a string? It certainly doesn't look like a piece of string.
Expressions have no range of emotion at all, arguments aren't debating anything. Comprehensions are incomprehensible, floats just lie there. You can't put a price on values, dictionaries have no order.
It's no wonder beginners think we're all nuts.
At edX, we have Python behind the scenes in courses to initialize the state of problems presented to students. Often, these problems are randomized so that different students will see different details in quantitative problems, but each student's random seed is saved so that the student will see the same problem if they revisit the page.
The seed is used to seed the random module before executing any chunk of course Python, so that you can simply use the random module and know that you'll get an appropriate value.
Today I found code like this in a course:
My task was to refactor how information flowed around, and the_seed wasn't going to be available, so I asked why the code was like this. It seemed odd, because the random module had just been seeded before this code was invoked, so why had the author bothered to re-seed the module with the same seed?
The answer was that it was a mysterious bug from months ago where the first time the code was run, it would produce a different result than any other time, and the re-seeding solved it. The q import seemed to be messing with the random seed, but only the first time.
The "only first time" clue pointed to it being code that is run on import. Remember, Python modules are just a series of statements, and when you import a module, it really executes all the statements. There's no "import mode" that just collects function definitions. If you write a statement with a side effect at the top level of a module, that side effect will happen when you import the module.
But statements in module are only executed the first time the module is imported in a process. Subsequent imports simply produce another reference to the existing module object. Everything pointed to a statement running during import which stomped on the random module.
The q module imported a number of other modules, including numpy and sympy. But why would importing a module re-seed the random module?
A little experimenting showed that sympy was at fault here:
Python 2.7.3 (default, Aug 1 2012, 05:16:07)
Looking at the values, after importing sympy, we've skipped ahead one number in our random sequence. So sympy isn't re-seeding the generator, it's consuming a random number.
To find out where, we resorted to a monkey-patching trick: Replace random.random with a booby-trap:
Python 2.7.3 (default, Aug 1 2012, 05:16:07)
OK, not sure why it's importing its tests when I try to use the package, but looking at the code, here's the culprit:
Here we can see the problem. Remember that function arguments are computed once, when the function is defined. Since this function is defined when the module is imported, random.random() will be called during import, consuming one of our random numbers.
Better would be to define it like this:
I'm not quite sure which behavior the author wanted, one seed for all the instances, or one seed per instance. I know I don't want importing this module to change my random number sequence.
Amusingly enough, the behavior of the initializer is irrelevant, it's only called in one place, and never defaults the seed argument:
def test(*paths, **kwargs):
The best solution for our code would be to not rely on the module-level random number sequence, and instead use our own Random object. Come to think of it, that's what sympy should do too.
BTW, looking at why sympy is importing test infrastructure when I import it, there's this in sympy/utilities/__init__.py:
"""Some utilities that may help.
This makes using utilities very convenient, since it contains everything at the top level. But the downside is it means you must always take everything. There is no way to import only part of utilities. Even if you use "from utilities.lambdify import lambdify," Python will execute the utilities/__init__.py file, importing everything.
Eryksun commented on yesterday's blog post,
Sure enough, that gets you a reference to the built-ins, even in an eval with no builtins. It points up two deficiencies in my searching code from yesterday. First, I was only examining constructed objects, not the classes themselves, but more importantly, I was looking at object's attributes but not the keys in dictionaries!
Here's the updated search code, which also has some nicer uses of generators:
"""Look for builtins..."""
When I run this code, it finds things like,
and many other similar lines. The builtins were right there all along, if you know where to look.
I discovered Floyd's follow-up to my Eval really is dangerous post. He catalogs a few interesting variations. At the end, though, he mentions the difficulty of finding the original builtins on Python 3.
If you remember, in Python 2, we did it like this:
This relies on the fact that warnings.catch_warnings is defined, so we can get it from object's subclasses, and on the fact that that object has a _module attribute which is a module.
Python 3 doesn't seem to have that class defined right off the bat, so we can't count on it for finding the builtins. But, I figured, there must be some other class that would serve the same purpose?
To find out, I tried searching for one. Here's the code I used:
This code iterates all the subclasses of object, and tries a bunch of different constructor arguments to try to make one. If it succeeds, it recursively examines the attributes reachable from the object, looking for an object or dict that has "open" and "__import__".
Running this on Python 3.3 sure enough doesn't find anything like builtins, after examining 20k objects. And running it on Python 2.7 finds only the catch_warnings object we had before.
I wouldn't have guessed it was so unusual for an object to hold a reference to a module. Am I overlooking an important principle, or is this just not something people do?