2500

Sunday 28 June 2020

This is the 2500th post on this blog. That’s a lot of writing. I estimate this site has about 480,000 words in total, enough for five books.

I’ve been writing here for more than 18 years. The pace is different than when I started: last year I wrote 33 posts. Compare that to 2003, when I wrote more than ten times as many: 441 posts! Twitter has siphoned off some of the short-post energy, but also interests shift over time.

Writing is a good way to understand things, and to learn things. People mostly think of writing as a way to teach and explain, and I am glad when my posts can do that. But I also really value the feedback loop of learning as I explain, and the deeper understanding I find when I teach.

Here’s a common piece of advice from people who create things: to make better things, make more things. Not only does it give you constant practice at making things, but it gives you more chances at lucking into making a good thing.

These days I set myself a goal of writing two posts a month. I find the goal helpful. It prods me to dig for topics. Some will be duds, but sometimes an apparently boring idea will turn out well.

I can’t promise everything (or anything!) will be interesting or insightful. But I’ll keep writing here. Thanks for reading.

Pickle’s nine flaws

Saturday 20 June 2020

Python’s pickle module is a very convenient way to serialize and de-serialize objects. It needs no schema, and can handle arbitrary Python objects. But it has problems. This post briefly explains the problems.

Some people will tell you to never use pickle because it’s bad. I won’t go that far. I’ll say, only use pickle if you are OK with its nine flaws:

  • Insecure
  • Old pickles look like old code
  • Implicit
  • Over-serializes
  • __init__ isn’t called
  • Python only
  • Unreadable
  • Appears to pickle code
  • Slow

The flaws

Here is a brief explanation of each flaw, in roughly the order of importance.

Insecure

Pickles can be hand-crafted that will have malicious effects when you unpickle them. As a result, you should never unpickle data that you do not trust.

The insecurity is not because pickles contain code, but because they create objects by calling constructors named in the pickle. Any callable can be used in place of your class name to construct objects. Malicious pickles will use other Python callables as the “constructors.” For example, instead of executing “models.MyObject(17)”, a dangerous pickle might execute “os.system(‘rm -rf /’)”. The unpickler can’t tell the difference between “models.MyObject” and “os.system”. Both are names it can resolve, producing something it can call. The unpickler executes either of them as directed by the pickle.

More details, including an example, are in Supakeen’s post Dangers in Python’s standard library.

Old pickles look like old code

Because pickles store the structure of your objects, when they are unpickled, they have the same structure as when you pickled them. This sounds like a good thing and is exactly what pickle is designed to do. But if your code changes between the time you made the pickle and the time you used it, your objects may not correspond to your code. The objects will still have the structure created by the old code, but they will be running with the new code.

For example, if you’ve added an attribute since the pickle was made, the objects from the pickle won’t have that attribute. Your code will be expecting the attribute, causing problems.

Implicit

The great convenience of pickles is that they will serialize whatever structure your object has. There’s no extra work to create a serialization structure. But that brings problems of its own. Do you really want your datetimes serialized as datetimes? Or as iso8601 strings? You don’t have a choice: they will be datetimes.

Not only don’t you have to specify the serialization form, you can’t specify it.

Over-serializes

Pickles are implicit: they serialize everything in your objects, even data you didn’t want to serialize. For example, you might have an attribute that is a cache of computation that you don’t want serialized. Pickle doesn’t have a convenient way to skip that attribute.

Worse, if your object contains an attribute that can’t be pickled, like an open file object, pickle won’t skip it, it will insist on trying to pickle it, and then throw an exception.

__init__ isn’t called

Pickles store the entire structure of your objects. When the pickle module recreates your objects, it does not call your __init__ method, since the object has already been created.

This can be surprising, since nowhere else do objects come into being without calling __init__. The logic here is that __init__ was already called when the object was first created in the process that made the pickle.

But your __init__ method might perform some essential work, like opening file objects. Your unpickled objects will be in a state that is inconsistent with your __init__ method. Or your __init__ might log information about the object being created. Unpickled objects won’t appear in the log.

Python only

Pickles are specific to Python, and are only usable by other Python programs. This isn’t strictly true, you can find packages for other languages that can use pickles, but they are rare. They will naturally be limited to the cross-language generic list/dict object structures, at which point you might as well just use JSON.

Unreadable

A pickle is a binary data stream (actually instructions for an abstract execution engine.) If you open a pickle as a plain file, you cannot read its contents. The only way to know what is in a pickle is to use the pickle module to load it. This can make debugging difficult, since you might not be able to search your pickle files for data you are interested in:

>>> pickle.dumps([123, 456])
b'\x80\x03]q\x00(K{M\xc8\x01e.'

Appears to pickle code

Functions and classes are first-class objects in Python: you can store them in lists, dicts, attributes, and so on. Pickle will gladly serialize objects that contain callables like functions and classes. But it doesn’t store the code in the pickle, just the name of the function or class.

Pickles are not a way to move or store code, though they can appear to be. When you unpickle your data, the names of the functions are used to find existing code in your running process.

Slow

Compared to other serialization techniques, pickle can be slow as Ben Frederickson demonstrates in Don’t pickle your data.

But but..

Some of these problems can be addressed by adding special methods to your class, like __getstate__ or __reduce__. But once you start down that path, you might as well use another serialization method that doesn’t have these flaws to begin with.

What’s better?

There are lots of other ways to serialize objects, ranging from plain-old JSON to fancier alternatives like marshmallow, cattrs, protocol buffers, and more.

I don’t have a strong recommendation for any one of these. The right answer will depend on the particulars of your problem. It might even be pickle...

Black lives matter

Saturday 13 June 2020

Since Trump’s “election” in 2016, politics have been overwhelming enough that it’s been difficult to catch my breath long enough to write anything about it. These last few weeks have only intensified that feeling, but have also demanded a response of some sort.

George Floyd’s killing was egregious enough to finally light a match to tinder that had been drying and accumulating for a long time. As difficult as it is to confront the gross injustices that run through our society, it is encouraging to see people come together to call it out and address it.

I can try to speak up in my small way. It would be easy to sit back and say I have not personally seen problems with police or in how society treats me. But that is not evidence that all is well. The difference between my experience and others’ is precisely the problem.

People with privilege, the people who can do something, are the people who don’t experience the problems. We have to listen to others, to people not like us. We have to face difficult truths about our place in society. It doesn’t mean we are bad people. It doesn’t mean we have sought to subjugate others. Privilege doesn’t mean we don’t have our own challenges and struggles. But we benefit where others do not, and we have to act.

Trump’s incompetence, disregard, corruption, and malice are on full display now, because of both COVID-19 and the Black Lives Matter movement. There are signs that this could be a significant turning point. But it will not be easy, and conflicts will get worse before they get better.

I’m looking for ways to help. I can donate money, though the current extensive energy on the left means the progressive organization landscape is cluttered and confusing. I wish there was more I could do. I am looking for ideas.

What I think is good and bad

Sunday 24 May 2020

I’m in the #python IRC channel on Freenode a lot. The people there are often quite opinionated. Julian had the idea of processing the logs to see what we thought was good, and what was bad, using sophisticated sentiment analysis.

Finding out what I liked and didn’t like wasn’t hard, since the “sophisticated sentiment analysis” was two regexes: “<nedbat>.* is good” and “<nedbat>.* is bad”!

Without further commentary, here is a sampling of things that I said were bad:

  • vertical alignment is bad because it means you might have to change many lines just because one of them got wider.
  • eval(input()) is bad.
  • blindly following stuff is bad.
  • trolling is bad. Find a way to use your brains for good.
  • floats for currency is bad.
  • any class that you can only instantiate once is bad.
  • that sounds like a singleton, which is bad.
  • some people say, “you should start by learning assembler,” and i think that is bad advice.
  • del is fine. __del__ is bad
  • implicit copying is bad
  • texture (repetition you can see when you squint) is bad in code
  • aligning indents with the opening delimiter is bad.
  • this is bad: def main(nums=[1,2,3]).
  • monkeypatching is bad
  • it is bad to modify a list you are iterating.
  • __import__ is bad
  • import * is bad stuff
  • the python doc search is bad
  • and-or is bad
  • singletons are hidden global state, and global state is bad
  • checking types is bad
  • that “:type myparam:” syntax is bad, it’s not readable. use google style instead: https://www.sphinx-doc.org/en/master/usage/extensions/napoleon.html#google-vs-numpy
  • python is bad at recursion deeper than 1000
  • there is a package on pypi called time, which is bad.

And here are some things I said were good:

  • “python -m app_name” is good, but recent.
  • endswith is good
  • xpath is good at selecting nodes in an xml dom tree.
  • recursion is good for recursive structures. Iteration is good for iterative structures
  • madlibs is good because you can do string manipulation with puerile humor
  • learning is good! :)
  • python is good for full applications.
  • re.sub is good.
  • mock(spec=thing) is good
  • excitement is good :)
  • pip will know how to install into virtualenvs, which is good.
  • gist.github.com is good
  • argparse is good for simple things
  • it is good to cover your tests, though others disagree with me
  • textwrap.dedent is good
  • requirements.txt is good for recreating environments.
  • the csv module is good at writing out dicts as rows.
  • this is good for seeing how rst will be formatted: http://rst.ninjs.org/
  • colorama is good
  • yaml is good
  • setUp is good. tearDown is better done as addCleanup
  • yield from is good for when you want one generator to produce all the values from another generator.
  • duck typing is good when there’s an operation supported across a number of types, and you can just use the operation without worrying about the type.
  • learning is good! :)
  • Django is good if you like having lots of things handled for you. Flask is good if you like to put together all the pieces yourself.
  • tox is good for testing against multiple environments
  • for validating email addresses, this is good: [^@ ]+@[^@ ]+\.[^@ ]+ https://nedbatchelder.com/blog/200908/humane_email_validation.html
  • the python.org tutorial is good if you have programmed in other languages before.
  • “pip install -e .” is good
  • the interactive interpreter is good for experimentation, but isn’t good for real development.
  • bpaste.net is good
  • Think Python is good
  • atexit is good.
  • python is good, and we are helpful and friendly :)
  • pandas is good if you need to manipulate tables of data. If you don’t, then don’t use pandas
  • you want to do something for each thing in a list, that’s what a for loop is good for :)
  • there’s an old habit of using “:type:blah:” or whatever, which is horrible. Sphinx now supports the “napoleon” style natively, which is good: http://www.sphinx-doc.org/en/1.4.8/ext/napoleon.html
  • pytest does cool assert rewriting, which 99.9999% of the time is good magic.
  • pudb is good in the terminal
  • trying to be efficient is good.
  • sql is good for some kinds of data. nosql is good for others.
  • .format is good
  • a decorator is good for wrapping functions in new functionality.
  • the pytest -k option is good at that.
  • i would not try to jam everything into setup.py. this feels like something a makefile is good at.
  • pytest is good at parameterized tests.
  • “if not list:” is good python
  • @classmethod is good for alternate constructors, yes.
  • rg is good: https://github.com/BurntSushi/ripgrep
  • the prompt is good for doing small experiments. Once you have larger programs, put them in .py files, and run them: python myprog.py
  • low tech is good tech
  • Learning is good.
  • numpy is good when you can do whole-matrix operations at once. If you need to iterate over elements and do individual operations, it doesn’t provide any benefit.
  • any is good when you have an iterable of true/false
  • choice is good. Why should there be only one implementation?
  • gist.github.com is good, or paste.pound-python.org
  • click is good
  • you’re not using a shell, which is good.
  • whatever helps you learn is good
  • numpy is good when you are doing matrix and array operations. lists are good for ordered collections of things
  • obfuscation isn’t something Python is good at.
  • the ast module is good for one thing: representing python programs as a tree of nodes. It provides tools for parsing source text into that tree.
  • python is good as a first language
  • isolation is good, but doing it with mocks can be a problem in itself
  • subclassing is good for when SubClass, by its essence is a ParentClass.
  • talking is good :)
  • coverage.py is good.
  • recursion is good for recursive structures (trees). iteration is better for linear structures (lists)
  • Jupyter is good for visualizations, graphing, tables, etc. interactive experimentation
  • lxml.html is good
  • sha256 is good too
  • Fluent Python is good, if you like books
  • for speed, PyPy is good. Or Cython. if you want to write C code, you can use cffi to call it from Python
  • curiosity is good
  • collections.Counter is good at counting things, and would do this in O(N).
  • .encode makes the conversion explicit, which is good. my_bytes = my_unicode.encode(“utf8”)
  • they is good.
  • pudb is good
  • in python 3, super() is good, but it doesn’t work in python 2.
  • the -k option is good for this, or you can define markers.
  • virtualenv is good for separating different projects’ needs
  • tig is good too
  • rst is good at multi-page docs, without assuming it will be html. markdown just shrugs and says, “use html when you have to”
  • writing is good just for its own sake.
  • bpaste.net is good.
  • learning is good
  • dependencies are good. using other people’s solutions to your problems is good.
  • split has a much better PR agency, but partition is good too
  • attribute access is good.
  • i won’t say that loops should introduce scopes, but it is good to be able to understand the interplay between scoping and closur-ing
  • https://pypi.org/project/appdirs/ is good for answering that question
  • automate the boring stuff is good. What kind of software will you be writing?
  • iso8601 is good
  • prompt_toolkit is good
  • numpy is good when you have an array full of data, and you can do one operation that works on all of it at once

Not much

Sunday 3 May 2020

My son Nat is 30 and has autism. His expressive language is somewhat limited, and he relies on rote answers when he can. A few years ago, one of his caregivers taught him that if someone asks, “What’s up?” you can answer, “Not much.”

At first I thought, “that’s not always a good answer,” but then I started paying attention to how people around me responded, and sure enough, not only is it a good answer, but it’s almost always the answer people give.

It’s especially appropriate these days. Nat has been living with us through these COVID-19 times since the middle of March, and he is frustrated at how limited his days have become.

One of Nat’s favorite things is a three-week calendar showing what’s going to happen. It was a regular routine when he would visit on weekends from his group home: we’d sit down and update his calendar. Every day would be marked with where he’d wake up, what he would do during the day, and where he would go to bed. Unusual activities and special events in particular would be noted and recited. He would often sit and study the calendar, or would ask to review it with us.

When he first moved in for the lock-down, we tried updating his calendar, but the result was too depressingly accurate: every day was the same, and every day was at home. The only special event was Passover, which had been changed from in-person to Zoom:

A Zoom seder with 19 people on 8 screens

Soon the weekly calendar was abandoned as uninteresting, and Nat started saying, “April,” meaning, I want it to be April when this will be over and I can go back to my regular life.

Of course, April came without a let-up of the lockdown, so he started saying, “May.” Now that we are in May, he says, “Summer.” We can’t give him a definite answer. The best we can do is to remind him that we have to wait, and that everyone he knows is also at home waiting.

I have been walking with Nat, a long-favored activity of ours. We’re up to about five miles a day, which is good for both of us.

My wife Susan has been handling most of the weekday activities other than the walks, trying to find things for Nat to do. They have been doing a lot of Facebook, a baking project most days, a little street basketball, and chores around the house.

She has taken to calling this “Suzie’s Little Day Program“:

Suzie’s Little Day Program pros and cons:

Pros: 1) Great staff-to-client ratio; 2) Lots of love; 3) Lots of napping; 4) Great treats; 5) Strong exercise component; 6) No ABA Whatsoever.

Cons: 1) Over-reliance on sugar; 2) Over-napping; 3) Not enough variety in peer group; 4) Moody staff; 5) Unreliable hours; 6) No ABA Whatsoever; 7) Often boring AF.

Overall, Nat is taking this very well. He has settled into this underwhelming routine. He dutifully wears his face mask on walks, and now knows to walk far around other people on the sidewalk without me prompting him. He likes getting on Zoom calls with the groups he is part of (MUSE foundation, Special Olympics, his day program), even if it’s just to watch because jumping in is difficult in those chaotic events.

We have gotten used to having him around full-time. There are 12 nearly empty bottles of shampoo in the shower (hard to explain). A few favorite Disney movies are on tight rotation. He keeps us a little more regimented, since ad-hoc is not his style.

I keep a close eye on him. He will not let us know if he starts to feel sick, so we have to be alert for him. He can be very passive, so it’s easy to feel guilty if he is staring into space or napping too much. We feel like we should be filling his time somehow, but it’s not possible to keep him busy ten hours a day.

Luckily, he has been in a calm period overall. There were other times in his life when these days would have been much stormier. We hope that his even temper continues. It’s been seven weeks so far, and we don’t know how much longer we are going to be together like this.

This is just our life right now. We’re all doing what we can. What’s going on? I’d have to say, not much.

Older: