|Ned Batchelder : Blog | Code | Text | Site|
The structure of .pyc files
» Home : Blog : April 2008
I spent some time digging around in the Python code to understand how .pyc files work. It turns out they are fairly simple, then kind of complex.
At the simple level, a .pyc file is a binary file containing only three things:
The magic number is nothing as cool as cafebabe, it's simply two bytes that change with each change to the marshalling code, and then two bytes of 0d0a. The 0d0a bytes are a carriage return and line feed, so that if a .pyc file is processed as text, it will change, and the magic number will be corrupted. This will keep the file from executing after a copy corruption. The marshalling code is tweaked in every major release of Python, so in practice the magic number is unique in each version of the Python interpreter. For Python 2.5, it's b3f20d0a. (the gory details are in import.c)
The four-byte modification timestamp is the Unix modification timestamp of the source file that generated the .pyc, so that it can be recompiled if the source changes.
The entire rest of the file is just the output of marshal.dump of the code object that results from compiling the source file. Marshal is like pickle, in that it serializes Python objects. It has different goals than pickle, though. Where pickle is meant to produce version-independent serialization suitable for persistence, marshal is meant for short-lived serialized objects, so its representation can change with each Python version. Also, pickle is designed to work properly for user-defined types, while marshal handles the complexities of Python internal types. The one we care about in particular here is the code object.
The nature of marshalling gives us the important characteristics of .pyc files: they are independent of platform, but very sensitive to Python versions. A 2.4 .pyc file will not execute under 2.5, but it can be copied from one operating system to another just fine.
So that's the simple part: two longs and a marshalled code object. The complexity, of course, is in the structure of the code object. They contain all sorts of information produced by the compiler, the meatiest of which is the bytecode itself.
Running this on the .pyc from an ultra-simple Python file:
A lot of this stuff I don't understand, but the byte codes are nicely disassembled and presented symbolically. The Python virtual machine is a stack-oriented interpreter, so a lot of the operations are loads and pops, and of course jumps and conditionals. For the adventurous: the byte-code interpreter is in ceval.c. The exact details of the byte codes change with each major version of Python. For example, the PRINT_ITEM and PRINT_NEWLINE opcodes we see here are gone in Python 3.0.
In the disassembled output, the left-most numbers (1, 2, 3) are the line numbers in the original source file and the next numbers (0, 3, 6, 9, ...) are the byte offsets of the instruction. The operands to the instruction are presented numerically, and then in parentheses, interpreted symbolically. Lines with ">>" are the targets of jump instructions somewhere else in the code.
This sample was very simple, with a single code object for the flow of instructions in the module. A real module with class and function definitions would be more complicated. The classes and functions would themselves be code objects in the consts list, nested as deeply as needed to represent the module. The module code object has class code objects which themselves have function code objects, and so on.
Once you start digging around at this level, there are all sorts of facilities for working with code objects. In the standard library, there's the compile built-in function, and the compiler, codeop and opcode modules. For the truly adventurous, there are third-party packages like codewalk, byteplay and bytecodehacks. PEP 339 gives more detail about compilation and opcodes. Finally, Ananth Shrinivas had another take on exploring Python bytecode.
tagged: python» 11 reactions