How long did it take you to learn Python?

Friday 27 March 2020

Wait, don’t answer that. It doesn’t matter.

Beginners seem to ask this question when they are feeling daunted by the challenge before them. Maybe they are hoping for a helpful answer, but it seems like most answers will just be a jumping off point for feeling bad about their own progress.

Everyone learns differently. They learn from different sources, at different paces. Suppose you ask this question and someone answers “one month”? Will you feel bad about yourself because you’ve been at it for six weeks? Suppose they say, “ten years”? Now what do you think?

The question doesn’t even make sense in a way. What do we mean by “learn”? If you can write a number guessing game in Python, have you learned Python? Are we talking about basic familiarity, or deep memorization? Does something have to be second nature, or is it OK if you are still looking through the docs for details? “Learned” is not a binary state. There isn’t a moment where you don’t know Python, and then suddenly you do.

And what do we even mean by “Python”? Are we talking about the basic syntax, or do you need to be able to write a metaclass, a descriptor, and a decorator with arguments? Is it just the language, or also the standard library? How many of the 200+ modules in the standard library do you need to be familiar with? What about commonly used third-party libraries? Are we also including the skills needed to write large (10k lines) programs in Python? “Python” is a large and varied landscape, and you will be finding out new things about it for years and years.

Especially since it keeps changing! Python isn’t sitting still, so you will never be done “learning Python.” I have been using Python for more than 20 years, and been deeply involved with it for at least half that time. I thought I knew Python well, then they added “async”. I will have to figure that out one of these days...

Since Python is used in many different domains, the things you need to learn could be completely different from someone else. These days, lots of people are learning Python to get into data science. I don’t do data science. Here are more things I don’t know (taken from a random sampling of “libraries you should know” blog posts): TensorFlow, Scikit-Learn, Numpy, Keras, PyTorch, SciPy, Pandas, Matplotlib, Theano, NLTK, etc. How should I compare my learning to a data scientist’s?

My advice to beginners is: don’t compare your learning to other peoples’. Everyone learns differently, using different materials, at different speeds. Everyone has different definitions of “learn,” and of “Python.” Understand your goals and your learning style. Find materials that work for you. Study, and learn in your own way. You can do it.

Functional strategies in Python

Friday 13 March 2020

I got into a debate about Python’s support for functional programming (FP) with a friend. One of the challenging parts was listening to him say, “Python is broken” a number of times.

Python is not broken. It’s just not a great language for writing pure functional programs. Python seemed broken to my friend in exactly the same way that a hammer seems broken to someone trying to turn a screw with it.

I understand his frustration. Once you have fully embraced the FP mindset, it is difficult to understand why people would write programs any other way.

I have not fully embraced the FP mindset. But that doesn’t mean that I can’t apply some FP lessons to my Python programs.

In discussions about how FP and Python relate, I think too much attention is paid to the tactics. For example, some people say, “no need for map/­filter/­lambda, use list comprehensions.” Not only does this put off FP people because they’re being told to abandon the tools they are used to, but it gives the impression that list com­pre­hensions are somehow at odds with FP constructs, or are exact replacements.

Rather than focus on the tactics, the important ideas to take from FP are strategies, including:

  • Write small functions with no side-effects
  • Don’t change existing data, make new data
  • Combine functions to make larger functions

These strategies all lead to modularized code, free from mysterious action at a distance. The code is easier to reason about, debug, and extend.

Of course, languages that are built from the ground up with these ideas in mind will have great tools to support them. They have tactics like:

  • Immutable data structures
  • Rich libraries of higher-order functions
  • Good support for recursion

Functional languages like Scheme, Clojure, Haskell, and Scala have these tools built-in. They are of course going to be way better for writing Functional programs than Python is.

FP people look at Python, see none of these tools, and conclude that Python can’t be used for functional programming. As I said before, Python is not a great language for writing purely function programs. But it’s not a lost cause.

Even without those FP tools in Python, we can keep the FP strategies in mind. Although list comprehensions are presented as the alternative to FP tools, they help with the FP strategies, because they force you to make new data instead of mutating existing data.

If other FP professionals are like my friend, they are probably saying to themselves, “Ned, you just don’t get it.” Perhaps that is true, how would I know? That doesn’t mean I can’t improve my Python programs by thinking Functionally. I’m only just dipping my toes in the water so far, but I want to do more.

For more thoughts about this:

  • Gary Bernhardt: Boundaries, a PyCon talk that discusses Functional Core and Imperative Shell.
  • If you want more Functional tools, there are third-party Python packages like:
    • pyrsistent, providing immutable data structures
    • pydash, providing functional tools
    • fnc, providing functional tools

Getting Started Testing with pytest

Thursday 20 February 2020

Next week I am presenting Getting Started Testing: pytest edition at Boston Python (event page).

This talk has been through a few iterations. In 2011, I gave a presentation at Boston Python about Getting Started Testing, based on the standard library unittest module. In 2014, I updated it and presented it at PyCon. Now I’ve updated it again, and will be presenting it at Boston Python.

The latest edition, Getting Started Testing: pytest edition, uses pytest throughout. It’s a little long for one evening of talking, but I really wanted to cover the material in it. I wanted to touch on not just the mechanics of testing, but the philosophy and central challenges as well.

I’m sure there are important things I left out, and probably digressions I could trim, but it’ll do. Thoughts welcome.

Re-using my presentations

Thursday 13 February 2020

Yesterday I got an email saying that someone in Turkey had stolen one of my presentations. The email included a YouTube link. The video showed a meetup. The presenter (I’ll call him Samuel) was standing in front of a title slide in my style that said, “Big-O: How Code Slows as Data Grows,” which is the title of my PyCon 2018 talk.

The video was in Turkish, so I couldn’t tell exactly what Samuel was saying, but I scrolled through the video, and sure enough, it was my entire talk, complete with illustrations by my son Ben.

Looking closer, the title slide had been modified:

My title slide, with someone else's name

(I’ve blurred Samuel’s specifics in this image, and Samuel is not his actual name. This post isn’t about Samuel, and I’m not interested in directing any more negative attention to him.)

Scrolling to the end of the talk, my last slide, which repeated my name and contact details, was gone. In its place was a slide promoting other videos featuring Samuel or his firm.

I felt like I had been forcibly elbowed off the stage, and Samuel was taking my place while trying to minimize my contributions.

In 2018, I did two things for this presentation: I wrote it, and I presented it at PyCon 2018. By far the most work was in the writing. It takes months of thinking, writing, designing, and honing to make a good presentation. In fact, of the two types of work, Samuel valued the writing most, since that is the part he kept. The reason this presentation attracted his attention, and why he wanted to present it himself, was because of its content.

“Originally presented by” is hardly the way to credit the author of a presentation, especially in small type while removing his name and leaving only a GitHub handle.

So I tweeted,

This is my talk from PyCon 2018, in its entirety, with my name nearly removed. It’s theft. I was not asked, and did not give permission.

Samuel apologized and took down the video. There were other tweets claiming that this was a pattern of Samuel’s, and that perhaps the apology would not be followed by changed behavior. But again, this post isn’t about Samuel.

This whole event got me thinking about people re-using my presentations.

I enjoy writing presentations. I like thinking about how to explain things. People have liked the explanations I’ve written. I like that they like them enough to want to show them to people.

But I’ve never thought much about how I would answer if someone asked me if they could present one of my talks. If people can use my talks to help strengthen their local community and up-skill their members, I want them to be able to. I am not interested in people using my talks to unfairly promote themselves.

I’m not sure re-using someone else’s presentation is a good idea. Wouldn’t it be better to write your own talk based on what you learned from someone else’s? But if people want to re-use a talk, I’d like to have an answer.

So here are my first-cut guidelines for re-using one of my talks:

  1. Ask me if you can use a talk. If I say no, then you can’t.
  2. Don’t change the main title slide. I wrote the presentation, my name should be on it. If you were lecturing about a novel, you wouldn’t hand out copies of the book with your name in place of the author’s.
  3. Make clear during the presentation that I was the author and first presenter. A way to do that would be to include a slide about that first event, with links, and maybe even a screenshot of me from the video recording of the first event.
  4. I include a short-link and my Twitter handle in the footer of my slides. Leave these in place. We live in a social online world. I want to benefit from the connections that might arise from one of my presentations.
  5. Keep my name and contact details prominent in the end slide.
  6. If your video is posted online, include my name and the point about this being a re-run in the first paragraph of the description.

It would be great if my talks could get a broader reach than I can make happen all by myself. To be honest, I’m still not sure if it’s a good idea to present someone else’s talk, but it’s better to do it this way than the way that just happened.

sys.getsizeof is not what you want

Sunday 9 February 2020

This week at work, an engineer mentioned that they were looking at the sizes of data returned by an API, and it was always coming out the same, which seemed strange. It turned out the data was a dict, and they were looking at the size with sys.getsizeof.

Sounds great! sys.getsizeof has an appealing name, and the description in the docs seems really good:

Return the size of an object in bytes. The object can be any type of object. All built-in objects will return correct results [...]

But the fact is, sys.getsizeof is almost never what you want, for two reasons: it doesn’t count all the bytes, and it counts the wrong bytes.

The docs go on to say:

Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.

This is why it doesn’t count all the bytes. In the case of a dictionary, “objects it refers to” includes all of the keys and values. getsizeof is only reporting on the memory occupied by the internal table the dict uses to track all the keys and values, not the size of the keys and values themselves. In other words, it tells you about the internal bookkeeping, and not any of your actual data!

The reason my co-worker’s API responses was all the same size was because they were dictionaries with the same number of keys, and getsizeof was ignoring all the keys and values when reporting the size:

>>> d1 = {"a": "a", "b": "b", "c": "c"}
>>> d2 = {"a": "a"*100_000, "b": "b"*100_000, "c": "c"*100_000}
>>> sys.getsizeof(d1)
>>> sys.getsizeof(d2)

If you wanted to know how large all the keys and values were, you could sum their lengths:

>>> def key_value_length(d):
...     klen = sum(len(k) for k in d.keys())
...     vlen = sum(len(v) for v in d.values())
...     return klen + vlen
>>> key_value_length(d1)
>>> key_value_length(d2)

You might ask, why is getsizeof like this? Wouldn’t it be more useful if it gave you the size of the whole dictionary, including its contents? Well, it’s not so simple. Data in memory can be shared:

>>> x100k = "x" * 100_000
>>> d3 = {"a": x100k, "b": x100k, "c": x100k}
>>> key_value_length(d3)

Here there are three values, each 100k characters, but in fact, they are all the same value, actually the same object in memory. That 100k string only exists once. Is the “complete” size of the dict 300k? Or only 100k?

It depends on why you are asking about the size. Our d3 dict is only about 100k bytes in RAM, but if we try to write it out, it will probably be about 300k bytes.

And sys.getsizeof also reports on the wrong bytes:

>>> sys.getsizeof(1)
>>> sys.getsizeof("a")

Huh? How can a small integer be 28 bytes? And the one-character string “a” is 50 bytes!? It’s because Python objects have internal bookkeeping, like links to their type, and reference counts for managing memory. That extra bookkeeping is overhead per-object, and sys.getsizeof includes that overhead.

Because sys.getsizeof reports on internal details, it can be baffling:

>>> sys.getsizeof("a")
>>> sys.getsizeof("ab")
>>> sys.getsizeof("abc")
>>> sys.getsizeof("á")
>>> sys.getsizeof("áb")
>>> sys.getsizeof("ábc")
>>> face = "\N{GRINNING FACE}"
>>> len(face)
>>> sys.getsizeof(face)
>>> sys.getsizeof(face + "b")
>>> sys.getsizeof(face + "bc")

With an ASCII string, we start at 50 bytes, and need one more byte for each ASCII character. With an accented character, we start at 74, but still only need one more byte for each ASCII character. With an exotic Unicode character (expressed here with the little-used \N Unicode name escape), we start at 80, and then need four bytes for each ASCII character we add! Why? Because Python has a complex internal representation for strings. I don’t know why those numbers are the way they are. PEP 393 has the details if you are curious. The point here is: sys.getsizeof is almost certainly not the thing you want.

The “size” of a thing depends on how the thing is being represented. The in-memory Python data structures are one representation. When the data is serialized to JSON, that will be another representation, with completely different reasons for the size it becomes.

In my co-worker’s case, the real question was, how many bytes will this be when written as CSV? The sum-of-len method would be much closer to the right answer than sys.getsizeof. But even sum-of-len might not be good enough, depending on how accurate the answer has to be. Quoting rules and punctuation overhead change the exact length. It might be that the only way to get an accurate enough answer is to serialize to CSV and check the actual result.

So: know what question you are really asking, and choose the right tool for the job. sys.getsizeof is almost never the right tool.

Color palette tools

Wednesday 15 January 2020

Two useful sites for choosing color palettes, both from map-making backgrounds. They both consider qualitative, sequential, and diverging palettes as different needs, which I found insightful.

  • Paul Tol’s notes, which gives special consideration to color-blindness. He has some visual demonstrations that picked up my own slight color-blindness.
  • Cynthia Brewer’s ColorBrewer, with interactive elements so you can create your own palette for your particular needs.

Color Palette Ideas is different: palettes based on photographs, but can also be a good source for ideas.

As an update to my ancient blog post about this same topic, Adobe Color and paletton both have tools for generating palettes in lots of over-my-head ways. And Color Synth Axis is still very appealing to the geek in me, though it needs Flash, and so I fear is not long for this world...