Re-using my presentations

Thursday 13 February 2020

Yesterday I got an email saying that someone in Turkey had stolen one of my presentations. The email included a YouTube link. The video showed a meetup. The presenter (I’ll call him Samuel) was standing in front of a title slide in my style that said, “Big-O: How Code Slows as Data Grows,” which is the title of my PyCon 2018 talk.

The video was in Turkish, so I couldn’t tell exactly what Samuel was saying, but I scrolled through the video, and sure enough, it was my entire talk, complete with illustrations by my son Ben.

Looking closer, the title slide had been modified:

My title slide, with someone else's name

(I’ve blurred Samuel’s specifics in this image, and Samuel is not his actual name. This post isn’t about Samuel, and I’m not interested in directing any more negative attention to him.)

Scrolling to the end of the talk, my last slide, which repeated my name and contact details, was gone. In its place was a slide promoting other videos featuring Samuel or his firm.

I felt like I had been forcibly elbowed off the stage, and Samuel was taking my place while trying to minimize my contributions.

In 2018, I did two things for this presentation: I wrote it, and I presented it at PyCon 2018. By far the most work was in the writing. It takes months of thinking, writing, designing, and honing to make a good presentation. In fact, of the two types of work, Samuel valued the writing most, since that is the part he kept. The reason this presentation attracted his attention, and why he wanted to present it himself, was because of its content.

“Originally presented by” is hardly the way to credit the author of a presentation, especially in small type while removing his name and leaving only a GitHub handle.

So I tweeted,

This is my talk from PyCon 2018, in its entirety, with my name nearly removed. It’s theft. I was not asked, and did not give permission.

Samuel apologized and took down the video. There were other tweets claiming that this was a pattern of Samuel’s, and that perhaps the apology would not be followed by changed behavior. But again, this post isn’t about Samuel.

This whole event got me thinking about people re-using my presentations.

I enjoy writing presentations. I like thinking about how to explain things. People have liked the explanations I’ve written. I like that they like them enough to want to show them to people.

But I’ve never thought much about how I would answer if someone asked me if they could present one of my talks. If people can use my talks to help strengthen their local community and up-skill their members, I want them to be able to. I am not interested in people using my talks to unfairly promote themselves.

I’m not sure re-using someone else’s presentation is a good idea. Wouldn’t it be better to write your own talk based on what you learned from someone else’s? But if people want to re-use a talk, I’d like to have an answer.

So here are my first-cut guidelines for re-using one of my talks:

  1. Ask me if you can use a talk. If I say no, then you can’t.
  2. Don’t change the main title slide. I wrote the presentation, my name should be on it. If you were lecturing about a novel, you wouldn’t hand out copies of the book with your name in place of the author’s.
  3. Make clear during the presentation that I was the author and first presenter. A way to do that would be to include a slide about that first event, with links, and maybe even a screenshot of me from the video recording of the first event.
  4. I include a bit.ly short-link and my Twitter handle in the footer of my slides. Leave these in place. We live in a social online world. I want to benefit from the connections that might arise from one of my presentations.
  5. Keep my name and contact details prominent in the end slide.
  6. If your video is posted online, include my name and the point about this being a re-run in the first paragraph of the description.

It would be great if my talks could get a broader reach than I can make happen all by myself. To be honest, I’m still not sure if it’s a good idea to present someone else’s talk, but it’s better to do it this way than the way that just happened.

sys.getsizeof is not what you want

Sunday 9 February 2020

This week at work, an engineer mentioned that they were looking at the sizes of data returned by an API, and it was always coming out the same, which seemed strange. It turned out the data was a dict, and they were looking at the size with sys.getsizeof.

Sounds great! sys.getsizeof has an appealing name, and the description in the docs seems really good:

sys.getsizeof(object)
Return the size of an object in bytes. The object can be any type of object. All built-in objects will return correct results [...]

But the fact is, sys.getsizeof is almost never what you want, for two reasons: it doesn’t count all the bytes, and it counts the wrong bytes.

The docs go on to say:

Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.

This is why it doesn’t count all the bytes. In the case of a dictionary, “objects it refers to” includes all of the keys and values. getsizeof is only reporting on the memory occupied by the internal table the dict uses to track all the keys and values, not the size of the keys and values themselves. In other words, it tells you about the internal bookkeeping, and not any of your actual data!

The reason my co-worker’s API responses was all the same size was because they were dictionaries with the same number of keys, and getsizeof was ignoring all the keys and values when reporting the size:

>>> d1 = {"a": "a", "b": "b", "c": "c"}
>>> d2 = {"a": "a"*100_000, "b": "b"*100_000, "c": "c"*100_000}
>>> sys.getsizeof(d1)
232
>>> sys.getsizeof(d2)
232

If you wanted to know how large all the keys and values were, you could sum their lengths:

>>> def key_value_length(d):
...     klen = sum(len(k) for k in d.keys())
...     vlen = sum(len(v) for v in d.values())
...     return klen + vlen
...
>>> key_value_length(d1)
6
>>> key_value_length(d2)
300003

You might ask, why is getsizeof like this? Wouldn’t it be more useful if it gave you the size of the whole dictionary, including its contents? Well, it’s not so simple. Data in memory can be shared:

>>> x100k = "x" * 100_000
>>> d3 = {"a": x100k, "b": x100k, "c": x100k}
>>> key_value_length(d3)
300003

Here there are three values, each 100k characters, but in fact, they are all the same value, actually the same object in memory. That 100k string only exists once. Is the “complete” size of the dict 300k? Or only 100k?

It depends on why you are asking about the size. Our d3 dict is only about 100k bytes in RAM, but if we try to write it out, it will probably be about 300k bytes.

And sys.getsizeof also reports on the wrong bytes:

>>> sys.getsizeof(1)
28
>>> sys.getsizeof("a")
50

Huh? How can a small integer be 28 bytes? And the one-character string “a” is 50 bytes!? It’s because Python objects have internal bookkeeping, like links to their type, and reference counts for managing memory. That extra bookkeeping is overhead per-object, and sys.getsizeof includes that overhead.

Because sys.getsizeof reports on internal details, it can be baffling:

>>> sys.getsizeof("a")
50
>>> sys.getsizeof("ab")
51
>>> sys.getsizeof("abc")
52
>>> sys.getsizeof("á")
74
>>> sys.getsizeof("áb")
75
>>> sys.getsizeof("ábc")
76
>>> face = "\N{GRINNING FACE}"
>>> len(face)
1
>>> sys.getsizeof(face)
80
>>> sys.getsizeof(face + "b")
84
>>> sys.getsizeof(face + "bc")
88

With an ASCII string, we start at 50 bytes, and need one more byte for each ASCII character. With an accented character, we start at 74, but still only need one more byte for each ASCII character. With an exotic Unicode character (expressed here with the little-used \N Unicode name escape), we start at 80, and then need four bytes for each ASCII character we add! Why? Because Python has a complex internal representation for strings. I don’t know why those numbers are the way they are. PEP 393 has the details if you are curious. The point here is: sys.getsizeof is almost certainly not the thing you want.

The “size” of a thing depends on how the thing is being represented. The in-memory Python data structures are one representation. When the data is serialized to JSON, that will be another representation, with completely different reasons for the size it becomes.

In my co-worker’s case, the real question was, how many bytes will this be when written as CSV? The sum-of-len method would be much closer to the right answer than sys.getsizeof. But even sum-of-len might not be good enough, depending on how accurate the answer has to be. Quoting rules and punctuation overhead change the exact length. It might be that the only way to get an accurate enough answer is to serialize to CSV and check the actual result.

So: know what question you are really asking, and choose the right tool for the job. sys.getsizeof is almost never the right tool.

Color palette tools

Wednesday 15 January 2020

Two useful sites for choosing color palettes, both from map-making backgrounds. They both consider qualitative, sequential, and diverging palettes as different needs, which I found insightful.

  • Paul Tol’s notes, which gives special consideration to color-blindness. He has some visual demonstrations that picked up my own slight color-blindness.
  • Cynthia Brewer’s ColorBrewer, with interactive elements so you can create your own palette for your particular needs.

Color Palette Ideas is different: palettes based on photographs, but can also be a good source for ideas.

As an update to my ancient blog post about this same topic, Adobe Color and paletton both have tools for generating palettes in lots of over-my-head ways. And Color Synth Axis is still very appealing to the geek in me, though it needs Flash, and so I fear is not long for this world...

Bug #915: solved!

Monday 13 January 2020

Yesterday I pleaded, Bug #915: please help! It got posted to Hacker News, where Robert Xiao (nneonneo) did some impressive debugging and found the answer.

The user’s code used mocks to simulate an OSError when trying to make temporary files (source):

with patch('tempfile._TemporaryFileWrapper') as mock_ntf:
    mock_ntf.side_effect = OSError()

Inside tempfile.NamedTemporaryFile, the error handling misses the possibility that _TemporaryFileWrapper will fail (source):

(fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
try:
    file = _io.open(fd, mode, buffering=buffering,
                    newline=newline, encoding=encoding, errors=errors)

    return _TemporaryFileWrapper(file, name, delete)
except BaseException:
    _os.unlink(name)
    _os.close(fd)
    raise

If _TemporaryFileWrapper fails, the file descriptor fd is closed, but the file object referencing it still exists. Eventually, it will be garbage collected, and the file descriptor it references will be closed again.

But file descriptors are just small integers which will be reused. The failure in bug 915 is that the file descriptor did get reused, by SQLite. When the garbage collector eventually reclaimed the file object leaked by NamedTemporaryFile, it closed a file descriptor that SQLite was using. Boom.

There are two improvements to be made here. First, the user code should be mocking public functions, not internal details of the Python stdlib. In fact, the variable is already named mock_ntf as if it had been a mock of NamedTemporaryFile at some point.

NamedTemporaryFile would be a better mock because that is the function being used by the user’s code. Mocking _TemporaryFileWrapper is relying on an internal detail of the standard library.

The other improvement is to close the leak in NamedTemporaryFile. That request is now bpo39318. As it happens, the leak had also been reported as bpo21058 and bpo26385.

Lessons learned:

  • Hacker News can be helpful, in spite of the tangents about shell redirection, authorship attribution, and GitHub monoculture.
  • There are always people more skilled at debugging. I had no idea you could script gdb.
  • Error handling is hard to get right. Edge cases can be really subtle. Bugs can linger for years.

I named Robert Xiao at the top, but lots of people chipped in effort to help get to the bottom of this. ikanobori posted it to Hacker News in the first place. Chris Caron reported the original #915 and stuck with the process as it dragged on. Thanks everybody.

Bug #915: please help!

Sunday 12 January 2020

Updated: this was solved on Hacker News. Details in Bug #915: solved!

I just released coverage.py 5.0.3, with two bug fixes. There was another bug I really wanted to fix, but it has stumped me. I’m hoping someone can figure it out.

Bug #915 describes a disk I/O failure. Thanks to some help from Travis support, Chris Caron has provided instructions for reproducing it in Docker, and they work: I can generate disk I/O errors at will. What I can’t figure out is what coverage.py is doing wrong that causes the errors.

To reproduce it, start a Travis-based docker image:

cid=$(docker run -dti --privileged=true --entrypoint=/sbin/init \
    -v /sys/fs/cgroup:/sys/fs/cgroup:ro \
    travisci/ci-sardonyx:packer-1542104228-d128723)
docker exec -it $cid /bin/bash

Then in the container, run these commands:

su - travis
git clone --branch=nedbat/debug-915 https://github.com/nedbat/apprise-api.git
cd apprise-api
source ~/virtualenv/python3.6/bin/activate
pip install tox
tox -e bad,good

This will run two tox environments, called good and bad. Bad will fail with a disk I/O error, good will succeed. The difference is that bad uses the pytest-cov plugin, good does not. Two detailed debug logs will be created: debug-good.txt and debug-bad.txt. They show what operations were executed in the SqliteDb class in coverage.py.

The Big Questions: Why does bad fail? What is it doing at the SQLite level that causes the failure? And most importantly, what can I change in coverage.py to prevent the failure?

Some observations and questions:

  • If I change the last line of the steps to “tox -e good,bad” (that is, run the environments in the other order) then the error doesn’t happen. I don’t understand why that would make a difference.
  • I’ve tried adding time.sleep’s to try to slow the pace of database access, but maybe in not enough places? And if this fixes it, what’s the right way to productize that change?
  • I’ve tried using the detailed debug log to create a small Python program that in theory accesses the SQLite database in exactly the same way, but I haven’t managed to create the error that way. What aspect of access am I overlooking?

If you come up with answers to any of these questions, I will reward you somehow. I am also eager to chat if that would help you solve the mysteries. I can be reached on email, Twitter, as nedbat on IRC, or in Slack. Please get in touch if you have any ideas. Thanks.

Older: