Python’s pickle module is a very convenient way to serialize and de-serialize objects. It needs no schema, and can handle arbitrary Python objects. But it has problems. This post briefly explains the problems.
Some people will tell you to never use pickle because it’s bad. I won’t go that far. I’ll say, only use pickle if you are OK with its nine flaws:
- Insecure
- Old pickles look like old code
- Implicit
- Over-serializes
- __init__ isn’t called
- Python only
- Unreadable
- Appears to pickle code
- Slow
The flaws
Here is a brief explanation of each flaw, in roughly the order of importance.
Insecure
Pickles can be hand-crafted that will have malicious effects when you unpickle them. As a result, you should never unpickle data that you do not trust.
The insecurity is not because pickles contain code, but because they create objects by calling constructors named in the pickle. Any callable can be used in place of your class name to construct objects. Malicious pickles will use other Python callables as the “constructors.” For example, instead of executing “models.MyObject(17)”, a dangerous pickle might execute “os.system(‘rm -rf /’)”. The unpickler can’t tell the difference between “models.MyObject” and “os.system”. Both are names it can resolve, producing something it can call. The unpickler executes either of them as directed by the pickle.
More details, including an example, are in Supakeen’s post Dangers in Python’s standard library.
Old pickles look like old code
Because pickles store the structure of your objects, when they are unpickled, they have the same structure as when you pickled them. This sounds like a good thing and is exactly what pickle is designed to do. But if your code changes between the time you made the pickle and the time you used it, your objects may not correspond to your code. The objects will still have the structure created by the old code, but they will be running with the new code.
For example, if you’ve added an attribute since the pickle was made, the objects from the pickle won’t have that attribute. Your code will be expecting the attribute, causing problems.
Implicit
The great convenience of pickles is that they will serialize whatever structure your object has. There’s no extra work to create a serialization structure. But that brings problems of its own. Do you really want your datetimes serialized as datetimes? Or as iso8601 strings? You don’t have a choice: they will be datetimes.
Not only don’t you have to specify the serialization form, you can’t specify it.
Over-serializes
Pickles are implicit: they serialize everything in your objects, even data you didn’t want to serialize. For example, you might have an attribute that is a cache of computation that you don’t want serialized. Pickle doesn’t have a convenient way to skip that attribute.
Worse, if your object contains an attribute that can’t be pickled, like an open file object, pickle won’t skip it, it will insist on trying to pickle it, and then throw an exception.
__init__ isn’t called
Pickles store the entire structure of your objects. When the pickle module recreates your objects, it does not call your __init__ method, since the object has already been created.
This can be surprising, since nowhere else do objects come into being without calling __init__. The logic here is that __init__ was already called when the object was first created in the process that made the pickle.
But your __init__ method might perform some essential work, like opening file objects. Your unpickled objects will be in a state that is inconsistent with your __init__ method. Or your __init__ might log information about the object being created. Unpickled objects won’t appear in the log.
Python only
Pickles are specific to Python, and are only usable by other Python programs. This isn’t strictly true, you can find packages for other languages that can use pickles, but they are rare. They will naturally be limited to the cross-language generic list/dict object structures, at which point you might as well just use JSON.
Unreadable
A pickle is a binary data stream (actually instructions for an abstract execution engine.) If you open a pickle as a plain file, you cannot read its contents. The only way to know what is in a pickle is to use the pickle module to load it. This can make debugging difficult, since you might not be able to search your pickle files for data you are interested in:
>>> pickle.dumps([123, 456])
b'\x80\x03]q\x00(K{M\xc8\x01e.'
Appears to pickle code
Functions and classes are first-class objects in Python: you can store them in lists, dicts, attributes, and so on. Pickle will gladly serialize objects that contain callables like functions and classes. But it doesn’t store the code in the pickle, just the name of the function or class.
Pickles are not a way to move or store code, though they can appear to be. When you unpickle your data, the names of the functions are used to find existing code in your running process.
Slow
Compared to other serialization techniques, pickle can be slow as Ben Frederickson demonstrates in Don’t pickle your data.
But but..
Some of these problems can be addressed by adding special methods to your class, like __getstate__ or __reduce__. But once you start down that path, you might as well use another serialization method that doesn’t have these flaws to begin with.
What’s better?
There are lots of other ways to serialize objects, ranging from plain-old JSON to fancier alternatives like marshmallow, cattrs, protocol buffers, and more.
I don’t have a strong recommendation for any one of these. The right answer will depend on the particulars of your problem. It might even be pickle...
Comments
Most other methods require some kind of schema declarations and processing (I’ve compared some in https://ict.swisscom.ch/2017/12/python-schema/; nowadays I would probably consider Pydantic).
In that sense, pickling corresponds to dynamic, type-less Python, whereas other methods correspond to typed Python.
https://link.medium.com/Z3hOpWCvC7
https://voidfiles.github.io/python-serialization-benchmark/
I believe pickle got a reputation for bad performance because in Python 2, some people didn't realize that you had to use cPickle, and you had to explicitly specify protocol; as otherwise the worst performing protocol would be used.
Great writeup otherwise, I have a much clearer idea of what (not) to expect from pickle after reading this!
https://arrow.apache.org/docs/python/ipc.html
The problem is less likely when using pickletools.optimize() but is still a big gotcha.
The best solution for equal byte representations of equal objects is to use a different serialization protocol, like JSON.
I think I had a good use-case for pickle. I was writing a script that had a long computation phase (several minutes), followed by a graphing phase to analyze the data and to render fancy graphics.
I finished the code for the first phase first, and I was iterating on variants of the second phase. To save time, I used pickle as a temporary serialization. It let me store the exact data structures to later be reused on future interpreter executions. After the code for the second phase was completed, both the pickled files and the pickle-related code became obsolete.
Pickle is good as a temporary short-term data storage across multiple executions of the same-ish source-code on the same interpreter version. It’s very unlikely to be a good solution for a production environment. It’s definitely not suitable for any long-term storage.
Add a comment: