I spent some time digging around in the Python code to understand how .pyc files work. It turns out they are fairly simple, then kind of complex.

At the simple level, a .pyc file is a binary file containing only three things:

  • A four-byte magic number,
  • A four-byte modification timestamp, and
  • A marshalled code object.

The magic number is nothing as cool as cafebabe, it's simply two bytes that change with each change to the marshalling code, and then two bytes of 0d0a. The 0d0a bytes are a carriage return and line feed, so that if a .pyc file is processed as text, it will change, and the magic number will be corrupted. This will keep the file from executing after a copy corruption. The marshalling code is tweaked in every major release of Python, so in practice the magic number is unique in each version of the Python interpreter. For Python 2.5, it's b3f20d0a. (the gory details are in import.c)

The four-byte modification timestamp is the Unix modification timestamp of the source file that generated the .pyc, so that it can be recompiled if the source changes.

The entire rest of the file is just the output of marshal.dump of the code object that results from compiling the source file. Marshal is like pickle, in that it serializes Python objects. It has different goals than pickle, though. Where pickle is meant to produce version-independent serialization suitable for persistence, marshal is meant for short-lived serialized objects, so its representation can change with each Python version. Also, pickle is designed to work properly for user-defined types, while marshal handles the complexities of Python internal types. The one we care about in particular here is the code object.

The nature of marshalling gives us the important characteristics of .pyc files: they are independent of platform, but very sensitive to Python versions. A 2.4 .pyc file will not execute under 2.5, but it can be copied from one operating system to another just fine.

So that's the simple part: two longs and a marshalled code object. The complexity, of course, is in the structure of the code object. They contain all sorts of information produced by the compiler, the meatiest of which is the bytecode itself.

Luckily it isn't hard to write a program to dump these things out, thanks to the marshal and dis modules:

import dis, marshal, struct, sys, time, types

def show_file(fname):
    f = open(fname, "rb")
    magic = f.read(4)
    moddate = f.read(4)
    modtime = time.asctime(time.localtime(struct.unpack('L', moddate)[0]))
    print "magic %s" % (magic.encode('hex'))
    print "moddate %s (%s)" % (moddate.encode('hex'), modtime)
    code = marshal.load(f)
    show_code(code)
    
def show_code(code, indent=''):
    print "%scode" % indent
    indent += '   '
    print "%sargcount %d" % (indent, code.co_argcount)
    print "%snlocals %d" % (indent, code.co_nlocals)
    print "%sstacksize %d" % (indent, code.co_stacksize)
    print "%sflags %04x" % (indent, code.co_flags)
    show_hex("code", code.co_code, indent=indent)
    dis.disassemble(code)
    print "%sconsts" % indent
    for const in code.co_consts:
        if type(const) == types.CodeType:
            show_code(const, indent+'   ')
        else:
            print "   %s%r" % (indent, const)
    print "%snames %r" % (indent, code.co_names)
    print "%svarnames %r" % (indent, code.co_varnames)
    print "%sfreevars %r" % (indent, code.co_freevars)
    print "%scellvars %r" % (indent, code.co_cellvars)
    print "%sfilename %r" % (indent, code.co_filename)
    print "%sname %r" % (indent, code.co_name)
    print "%sfirstlineno %d" % (indent, code.co_firstlineno)
    show_hex("lnotab", code.co_lnotab, indent=indent)
    
def show_hex(label, h, indent):
    h = h.encode('hex')
    if len(h) < 60:
        print "%s%s %s" % (indent, label, h)
    else:
        print "%s%s" % (indent, label)
        for i in range(0, len(h), 60):
            print "%s   %s" % (indent, h[i:i+60])

show_file(sys.argv[1])

Running this on the .pyc from an ultra-simple Python file:

a, b = 1, 0
if a or b:
    print "Hello", a

produces this:

magic b3f20d0a
moddate 8a9efc47 (Wed Apr 09 06:46:34 2008)
code
   argcount 0
   nlocals 0
   stacksize 2
   flags 0040
   code
      6404005c02005a00005a0100650000700700016501006f0d000164020047
      65000047486e01000164030053
  1           0 LOAD_CONST               4 ((1, 0))
              3 UNPACK_SEQUENCE          2
              6 STORE_NAME               0 (a)
              9 STORE_NAME               1 (b)

  2          12 LOAD_NAME                0 (a)
             15 JUMP_IF_TRUE             7 (to 25)
             18 POP_TOP
             19 LOAD_NAME                1 (b)
             22 JUMP_IF_FALSE           13 (to 38)
        >>   25 POP_TOP

  3          26 LOAD_CONST               2 ('Hello')
             29 PRINT_ITEM
             30 LOAD_NAME                0 (a)
             33 PRINT_ITEM
             34 PRINT_NEWLINE
             35 JUMP_FORWARD             1 (to 39)
        >>   38 POP_TOP
        >>   39 LOAD_CONST               3 (None)
             42 RETURN_VALUE
   consts
      1
      0
      'Hello'
      None
      (1, 0)
   names ('a', 'b')
   varnames ()
   freevars ()
   cellvars ()
   filename 'C:\\ned\\sample.py'
   name '<module>'
   firstlineno 1
   lnotab 0c010e01

A lot of this stuff I don't understand, but the byte codes are nicely disassembled and presented symbolically. The Python virtual machine is a stack-oriented interpreter, so a lot of the operations are loads and pops, and of course jumps and conditionals. For the adventurous: the byte-code interpreter is in ceval.c. The exact details of the byte codes change with each major version of Python. For example, the PRINT_ITEM and PRINT_NEWLINE opcodes we see here are gone in Python 3.0.

In the disassembled output, the left-most numbers (1, 2, 3) are the line numbers in the original source file and the next numbers (0, 3, 6, 9, ...) are the byte offsets of the instruction. The operands to the instruction are presented numerically, and then in parentheses, interpreted symbolically. Lines with ">>" are the targets of jump instructions somewhere else in the code.

This sample was very simple, with a single code object for the flow of instructions in the module. A real module with class and function definitions would be more complicated. The classes and functions would themselves be code objects in the consts list, nested as deeply as needed to represent the module. The module code object has class code objects which themselves have function code objects, and so on.

Once you start digging around at this level, there are all sorts of facilities for working with code objects. In the standard library, there's the compile built-in function, and the compiler, codeop and opcode modules. For the truly adventurous, there are third-party packages like codewalk, byteplay and bytecodehacks. PEP 339 gives more detail about compilation and opcodes. Finally, Ananth Shrinivas had another take on exploring Python bytecode.

tagged: » 11 reactions

Comments

[gravatar]
Greg 11:42 AM on 9 Apr 2008

This is really cool, So cool in fact that I made it into an online utility: PYC Xray.

I hope you don't mind.

[gravatar]
localhost 12:04 PM on 9 Apr 2008

informative & concise, great post!

[gravatar]
Jason Harris 2:31 PM on 9 Apr 2008

Try 'i' for the unpack format on a 64 bit machine if you are having trouble.
moddate = f.read(4)
modtime = time.asctime(time.localtime(struct.unpack('i', moddate)[0]))

[gravatar]
Trent Mick 2:13 AM on 11 Apr 2008

(Mostly off-topic, but I thought interesting.)

Regarding the 0xcafebabe magic number for Java class data: When writing filetype detection code for Komodo's "Replace in Files" (http://svn.openkomodo.com/openkomodo/checkout/openkomodo/trunk/src/python-sitelib/textinfo.py) I ran into the fact that Mach-O universal binary data magic number is also "0xcafebabe".

$ uname
Darwin
$ od -H -N 4 /System/Library/Java/com/apple/net/URLFilter.class
0000000 bebafeca
0000004
$ od -H -N 4 /usr/bin/python
0000000 bebafeca
0000004

From /usr/share/file/magic (i.e. the data that `file` uses to guess filetype):

# Since Java bytecode and Mach-O universal binaries have the same magic number
# the test must be preformed in the same "magic" sequence to get both right.
# The long at offset 4 in a universal binary tells the number of architectures.
# The short at offset 4 in a Java bytecode file is the compiler minor version
# and the short at offset 6 is the compiler major version. Since there are
# only 18 labeled Mach-O architectures at current, and the first released
# Java class format was version 43.0, we can safely choose any number
# between 18 and 39 to test the number of architectures against
# (and use as a hack).

[gravatar]
zeo 3:14 PM on 10 Jun 2008

hi !
is there no downloadable decomyle / depython appication available on the web ?
(which at least processes python 2.5) only these expensive services ?

[gravatar]
dogeen 3:21 AM on 25 Mar 2010

Thanks a lot for the article.. This is exactly what I need to rewrite the bytecode in pyc files.
But I may actually write a decompy/depython using your technique and make it open-source :)..

maybe those who are trying to make money from decompy will learn >:)

[gravatar]
Eli 1:51 AM on 22 Sep 2010

Nice post, Ned.

One suggestion: in show_code the indented prints are very repetitive (%s and then passing indent every time). It seems better to encapsulate this into an internal function defined inside show_code that prints its args with the correct offset.

[gravatar]
Fernando 7:08 PM on 21 Sep 2011

Great code, I had to change struct.unpack('L', moddate)[0]) to struct.unpack('<L', moddate)[0]) to get it working on 64 bits python. I think if you use it on x86 with '<L' it works too.

[gravatar]
Anders 3:33 AM on 27 Jan 2012

Great work!

(I used the adaption on:
http://code.activestate.com/recipes/577880-inspect-a-pyc-file/ and it made me
see in the dark. =)

[gravatar]
Andreas 6:55 PM on 11 Jul 2013

On OS X you can decode .pyc files now automatically and visualize the contents with Synalyze It!.

[gravatar]
Ian 8:49 AM on 24 Nov 2013

The format has changed slightly as of Python 3.3+, so the algorithm above no longer works. In addition to the two original four-byte fields there is a new four-byte field that encodes the size of the source file as a long. Consequently the marshaled code object now begins at position 12.

Add a comment:

name
email
Ignore this:
not displayed and no spam.
Leave this empty:
www
not searched.
 
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.