|Ned Batchelder : Blog | Code | Text | Site|
I am on the plane back to Boston from PyCon 2015 in Montreal. You've probably read over and over again that PyCon is the best conference ever, yadda-yadda. I haven't been to another conference in a long time, so I don't have points of comparison. I can tell you that PyCon feels like a huge family reunion.
I started on Thursday, and was not feeling part of things. I don't know why. I thought perhaps 9 PyCons in a row is too many. I thought maybe I should be spending my energies elsewhere.
But Friday, I started the day by helping with the keynotes, keeping time, tracking down speakers, and so on. I felt involved. I was helping friends with things they needed to do.
PyCon is almost entirely organized and run by volunteers. There is one employee, all the rest is done by people just helping as a side project. I think this gives the event a tone of something you do, rather than something you attend or consume. Anyone can volunteer to make things happen, and it can be a really good way to meet people.
There are 2500 people at PyCon, but we are all in the same group. There isn't a entire cadre of paid staff on one side, and attendees on the other. We're all making the conference happen in our own ways. It an open-source conference in the truest sense of the word.
My co-worker Adam Palay gave his talk early on Friday. I'd first seen Adam speak in a lightning talk at Boston Python. His girlfriend Anne was there to record him. They seemed supportive and close. I really liked the talk he gave, and told him so. When the call for talks opened for PyCon, he let me know he was submitting a proposal, and I helped him where I could.
His talk was accepted, along with mine and two other speakers from edX. For each talk, we had a rehearsal at work, and at a Boston Python rehearsal night. Each time Adam rehearsed his talk, his girlfriend Anne and his brother Josh were there. I was impressed by their support. It turned out Anne was going to not only come to Montreal, but attend the conference with him.
Friday morning at PyCon, I went to Adam's talk. Sitting in the second row was Anne. Next to her was Josh. Next to him was Adam's sister, and on either side were his mother and father, all with conference badges! I joked about "Team Palay", and that the five of them should have held up cards spelling P-A-L-A-Y.
Clearly, this level of support from a family is unusual, to take the time, buy airfare and hotel, and pay the conference fees, just to see Adam present his 30-minute talk at a technical conference.
I'm explaining all this about Adam's supportive family because when I am at PyCon, I feel a bit like Adam must all the time. I am surrounded by friends who feel like family. We are brought together by an odd esoteric shared interest, but we come together each year, and interact online throughout the year. We are together to talk about technical topics, but it goes beyond that.
I know this must sound like a greeting card or something. Don't get me wrong: like any family, there is friction. I don't like everyone in the Python world. But so many people at PyCon know each other and have built relationships over years, there are plenty of friendly faces all around.
All those friendly faces give rise to an effect my devops guy Feanil coined "Ned latency": the extra time I have to figure in when planning to be at a certain place at a certain time. When traveling over any significant distance at PyCon, there will be people I want to stop and talk to.
This is called the "hallway track": the social or technical activity that happens in the hallways all during the day, regardless of the track talks. I've spoken to people at PyCon who've said, "I haven't seen any talks!"
Last year during lunch, I happened to sit next to a woman I didn't know. We introduced ourselves. Her name was Jenny. We chatted a bit, and then headed off our own activities. Over the next few days, I'd wave to Jenny as we passed each other on the escalators, and so on.
I saw Jenny again this year and miraculously remembered her name, so I waved and said, "Hi Jenny." This happened a few times. Later in the weekend, Jenny came up to me and said, "I want to thank you, you really made me feel welcome."
This made me really happy. I was saying hi to Jenny originally so that I would know more people, but we'd made a tiny connection that helped her in some way, and she felt strongly enough about it to tell me. Ian describes a similar dynamic from the bag-stuffing evening: just learning another person's name gives you a connection to that person that can last a surprisingly long time.
There are people I greet at PyCon purely because I've been chatting with them for five minutes once a year at every PyCon I've been to.
One of the highlights of PyCon for me is giving talks. I've spoken at the last 7 PyCons (the talks are on my text page). I put a lot of work into the talks, and am proud that they have some lasting power as things people recommend to other learners. After a talk, people always ask, "how did it go?" My answer is usually, "people seemed to like it," but the other half is, "on the inside, horrible. I know all the things I wish I had done differently!"
On Sunday evening, Shauna Gordon-McKeon and Open Hatch organized an intro to sprinting session for new contributors. I agreed to be a mentor there, thinking it would be a classroom style lecture, with mentors milling around helping people one-on-one. Turned out it was a series of 15-minute lectures at a number of stations around the room, with people shuttling between topics they wanted to hear about. I was the speaker on unit testing.
I was able to start by saying, if you really want to know about this, see my PyCon talk from last year, Getting Started Testing. Then I launched into an impromptu 15-minute overview of unit testing.
During one of the breaks, on my way to the water fountain, I passed a woman in the hallway watching the talk on her headphones. She said it was great, then later on Twitter, we had a typical PyCon love-fest.
To be able to see someone learning from something you've created is very gratifying and rewarding.
I attended one day of sprints. My main project there was Open edX, but I also said I would be sprinting on coverage.py, which I had never done before. I'd always had the feeling that coverage.py was esoteric and thorny, and it would be difficult to get contributors going. I was pleasantly surprised that five people joined me to make some headway against issues in the tracker.
But some of the interesting bugs are about branch coverage, which I had become somewhat frustrated by. I warned people that the problems might require a complete rewrite, but they were game to look into it.
Mickie Betz in particular was digging into an issue involving loops in generators. I was interested to watch her progress, and helped her with debugging techniques, but was not hopeful that there was a practical fix. To my surprise, a day later, she has submitted a pull request with a very simple solution. Mickie has restored my faith in humanity. She persevered in the face of a discouraging maintainer (me), and proved him wrong!
Another sprinter, Jon Chappell, picked up an issue that was easy but annoying to fix. Annoying because it was asking coverage.py to accommodate a stupid limitation in a different tool. It was not glamourous work, but I really appreciated him taking the task so that I didn't have to do it.
Two other sprinters, Conrad Ho and Leonardo Pistone, have each submitted a pull request, and Leonardo is also chasing down other issues. Lastly, Frederick Wagner has expressed interest in adding a warning-suppressing feature.
A very productive time, considering I was only at the sprints for about four hours. PyCon is amazing.
One thing I've never seen at PyCon is organized juggling. I considered bringing beanbags with me this time, but thought they would be heavy to carry around. Then Yelp was handing out bouncy balls at their booth, so I got four of those, and used them all weekend. It was a good way to play with people, especially once we did some pair juggling. Next year, I'll bring some serious equipment, and have a real open space (or two!) Who's in?
I don't know why I felt off the first day. PyCon is an amazing time, and now I again can't imagine missing it. It connects you to people. One afternoon, an attendee pulled me aside to show me a bug in coverage.py. I looked in the issue tracker, and saw that it had been written up four years ago by Christian Heimes, who was attending PyCon this year for the first time, and who I met at the bar on my first night!
PyCon energizes me, and cements my relationship to the entire Python world. Sometimes I wonder about a programming language as the basis for a group of people, but why not? They share my sensibilities and interests. They like what I do, and I like what they do. We move in similar circles. Do you need better reasons for a group of 2500 people to be close friends?
I gave my talk yesterday at PyCon 2015: Python Names and Values. PyCon has always been good at getting videos online, but they just keep getting better: the video was online the same day.
People ask me afterwards how the talk went. I got good reactions, but I also know what I would like to have done differently. I think I spoke too fast, and I think I should have had more practical advice about not mutating values if you can avoid it.
At least I didn't swear on stage this time...
My youngest son Ben turned 17 today. He is fascinated with mushrooms, so we made him a mushroom cake. Actually a trio of cakes:
It looks a bit like cupcakes, but no cupcakes were harmed in the making of this cake.
The main mushroom has a stem ("It's called a stipe, Dad") made of two 4.5-inch cake rounds. The cap ("pileus, Dad") was baked in the bottom of a stainless steel mixing bowl. The two stem pieces bulged more than we expected, so we sliced them off and made caps for the medium mushrooms. They are supported by stacked Ring-Dings for the stem.
The dots are mega M&M's. The tiny mushrooms are mini-marshmallows supporting white chocolate Reese's peanut butter cups. Gummi worms add character.
A cut-away view of the medium mushroom:
One of the things that is very useful about Python is its extreme introspectability and malleability. Taken too far, it can make your code an unmaintainable mess, but it can be very handy when trying to debug large and complex projects.
Open edX is one such project. Its main repository has about 200,000 lines of Python spread across 1500 files. The test suite has 8000 tests.
I noticed that running the test suite left a number of temporary directories behind in /tmp. They all had names like tmp_dwqP1Y, made by the tempfile module in the standard library. Our tests have many calls to mkdtemp, which requires the caller to delete the directory when done. Clearly, some of these cleanups were not happening.
To find the misbehaved code, I could grep through the code for calls to mkdtemp, and then reason through which of those calls eventually deleted the file, and which did not. That sounded tedious, so instead I took the fun route: an aggressive monkeypatch to find the litterbugs for me.
My first thought was to monkeypatch mkdtemp itself. But most uses of the function in our code look like this:
Because the function was imported directly, if my monkeypatching code ran after this import, the call wouldn't be patched. (BTW, this is one more small reason to prefer importing modules, and using module.function in the code.)
Looking at the implementation of mkdtemp, it makes use of a helper function in the tempfile module, _get_candidate_names. This helper is a generator that produces those typical random tempfile names. If I monkeypatched that internal function, then all callers would use my code regardless of how they had imported the public function. Monkeypatching the internal helper had the extra advantage that using any of the public functions in tempfile would call that helper, and get my changes.
To find the problem code, I would put information about the caller into the name of the temporary file. Then each temp file left behind would be a pointer of sorts to the code that created it. So I wrote my own _get_candidate_names like this:
This code uses inspect.stack to get the call stack. We slice it oddly, to get the closest three calling frames in the right order. Then we extract the filenames from the frames, strip off the ".py", and concatenate them together along with the line number. This gives us a string that indicates the caller.
The real _get_candidate_names function is used to get a generator of good random names, and we add our stack inspection onto the name, and yield it.
Then we can monkeypatch our function into tempfile. Now as long as this module gets imported before any temporary files are created, the files will have names like this:
The first shows that the file was created in test_import_export.py at line 289, called from case.py line 78, from case.py line 53. The second shows that test_video.py has a few functions calling eventually into tempfile.py.
I would be very reluctant to monkeypatch private functions inside other modules for production code. But as a quick debugging trick, it works great.
Coverage.py has a trace function written in C, for speed. It uses the Python C API, which is notoriously tricky to get right because you have to manage reference counts yourself.
I've made some significant changes to the trace function recently, to add plugin support to the C tracer. Adding tests for badly behaved plugins, I managed to crash Python. Not a traceback, a for-real crash in CPython.
Naturally, this means something is wrong in my C extension. Poring over the code, I couldn't see anything amiss. I'd long been intrigued by the idea of David Malcolm's CPyTracer, a plugin to gcc that performs static path analysis to find mistakes in Python C extensions, so I decided to give it a try.
The best instructions are on A. Jesse Jiryu Davis' blog: Analyzing Python C Extensions With CPyChecker. I installed Fedora as suggested, and got the compiler running without much trouble (I just typed "yum" every time I wanted to type "apt-get").
The simple way to run the checker worked fine:
This generates very nice HTML reports (like this) in two different styles that walk you through a path through your code that leads to a bad outcome. Well, supposedly a bad outcome. I found as Jesse did that there are false positives.
With the default settings, the checker only considers 256 paths through a function then stops, to avoid combinatorial explosions. But my functions had many more paths than that.
I increased the memory size of my Fedora Vagrantfile, then told CPyChecker to push on to examine a quarter million paths:
This found a few issues, but did not resolve the crash I'm experiencing. Next step: rebuild CPython --with-pydebug.
BTW, Stefan Behnel has rewritten my extension in Cython, and I really should seriously consider switching over, so that this kind of thing doesn't happen any more.
For my Mom's 75th birthday (tomorrow), we made a laptop cake, because she is as much tied to her computer as I am:
Plain yellow cake, cut in half, with the "screen" propped up with Yodels. The keyboard is bite-sized Snickers. The trackpad is a Petite Ecolier cookie, upside-down with its signature chocolate layer hidden in the cake. The screen has Skittles for the window controls, and various candies for the icons in the dock. A swirled candy is standing in as the spinning beach ball!
If you have not seen our cakes before, they are not poised or perfect. We have a great time making them, and do not worry too much about technical accuracy. We don't use professional materials like fondant. We use stuff you can get in the supermarket, and the candy aisle is an important stop. They are fun, and they taste great!
Lots of things happening in coverage.py world these days. Turns out I broke the XML report a long time ago, so that directories were not reported as packages. I honestly don't know why I let that sit for so long. It's fixed now, but I feel bad that I've ignored people's bug reports and pull requests. I'll try to be more responsive.
The fix is in coverage.py v4.0a3. Also, the reports now use file names instead of a weird hybrid. Previously, the file "a/b/c.py" was reported as "a/b/c". Now it is shown as "a/b/c.py". This works better where non-Python files can be reported, so we can't assume the extension is .py.
Oh, did I mention that now you can coverage-measure your Django templates?
Also in the XML report, there's now a configuration setting to determine the directory depth that will be reported as packages. The default is that all directories will be reported as packages, but if you set the depth to 2, then only the upper two layers of directories will be reported.
Try coverage.py v4.0a3.
New programmers often need small projects to work on as they hone their skills. Exercises in courses are too small, and don't leave much room for self-direction or extending to follow the interests of the student. "Real" projects, either in the open-source world or at work, tend to be overwhelming and come with real-world constraints that prevent experimentation.
Kindling projects are meant to fill this gap: simple enough that a new learner can take them on, but with possibilities for extension and creativity. Large enough that there isn't one right answer, but designed to be hacked on by a learner simply to flex their muscles.
To help people find projects like these, I've made a page: Kindling Projects. I'm hoping it will grow over time and be useful to people.
If you have suggestions, send them in.
A long experiment has come to fruition: coverage.py support for Django templates. I've added plugin support to coverage.py, and implemented a plugin for Django templates. If you want to try it in its current alpha state, read on.
The plugin itself is pip installable:
To run it, add these settings to your .coveragerc:
Then run your tests under coverage.py. It requires coverage.py >= 4.0a2, so it may not work with other coverage-related tools such as test-runner coverage plugins, or coveralls.io. The plugin works on Django >= 1.4, and Python 2 or 3.
You will see your templates listed in your coverage report alongside your Python modules. They have a .html extension but no directory, that's still to be fixed.
The technique used to measure the coverage is the same that Dmitry Trofimov used in dtcov, but integrated into coverage.py as a plugin, and made more performant. I'd love to see how well it works in a real production project. If you want to help me with it, feel free to drop me an email.
The coverage.py plugin mechanism is designed to be generally useful for hooking into the collection and reporting phases of coverage.py, specifically to support non-Python files. I've also got a plugin for Mako templates, but it needs some fixes from Mako. If you have non-Python files you'd like to support in coverage.py, let's talk.
The recent holidays gave us Christmas and New Year's Day on a Thursday, so I also had two Fridays off. This gave me two four-day weekends in a row. At the same time, I got a pull request against Cog from Doug Hellmann. Together, these gave me the time and the reason to update Cog.
So I cleaned up the couple of old pull requests, and open issues, and modernized the repo quite a bit.
Cog 2.4 is available now, with three new features:
It was nice to revisit this old friend, and be able to tend it and ship it.
Across more than 30 repos, we have more than 9500 pull requests. To get detailed information about all of them would require at least 9500 requests to the GitHub API. But GitHub rate-limits API use, so I can only make 5000 requests in an hour, so I can't collect details across all of our pull requests.
Most of those pull requests are old, and closed. They haven't changed in a long time. GitHub supports ETags, and any request that responds with 304 Not Modified isn't counted against your rate limit. So I should be able to use ETags to mostly get cached information, and still be able to get details for all of my pull requests.
I ran my program with this, and it didn't seem to help: I was still running out of requests against the API. Doing a lot of debugging, I figured out why. The reason is instructive for API design.
When you ask the GitHub API for details of a pull request, you get a JSON response that looks like this (many details omitted, see the GitHub API docs for the complete response):
GitHub has done a common thing with their REST API: they include details of related objects. So this pull request response also includes details of the users involved, and the repos involved, and the repos include details of their users, and so on.
The ETag for a response fingerprints the entire response. That means that if any data in the response changes, the ETag will change, which means that the cached copy will be ignored and the full response will be returned.
Look again at the repo information included: open_issues_count changes every time an issue is opened or closed. A pull request is a kind of issue, so that happens a lot. There's also pushed_at and updated_at, which will change frequently.
So when I'm getting details about a pull request that has been closed and dormant for (let's say) a year, the ETag will still change many times a day, because of other irrelevant activity in the repo. I didn't need those repo details on the pull request in the first place, but I always thought it was just harmless bulk. Nope, it's actively hurting my ability to use the API effectively.
Some REST API's give you control over the fields returned, or related objects included in responses, but GitHub's does not. I don't know how to use the GitHub API the way I wanted to.
So the pull request response has lots of details I don't need (the repo's owner's avatar URL?), and omit plenty of details I'm likely to need, like commits, comments, and so on. I understand, they aren't including one-to-many information at all, but I'd rather see the one-to-many than the almost certainly useless one-to-one information that is included, and is making automatic caching impossible.
Luckily, my co-worker David Baumgold had a good idea and the energy to implement it: webhookdb replicates GitHub data to a relational database, using webhooks to keep the two in sync. It works great: now I can make queries against Postgres to get details of pull requests! No rate limiting, and I can use SQL if it's a better way to express my questions.
One of the challenging things about programming is being able to really see code the way the computer is going to see it. Sometimes the human-only signals are so strong, we can't ignore them. This is one of the reasons I like indentation-significant languages like Python: people attend to the indentation whether the computer does or not, so you might as well have the people and computers looking at the same thing.
I was reminded of this problem yesterday while trying to debug a sample application I was toying with. It has a config file with some strings and dicts in it. It reads in part like this:
When I saw this file, I thought, "That's a weird way to comment things," but didn't worry more about it. Then later when the response was failing, I debugged into it, and realized what was wrong with this file. Before reading on, do you see what it is?
• • •
• • •
• • •
Python concatenates adjacent string literals. This is handy for making long strings without having to worry about backslashes. In real code, this feature is little-used, and it happens in a surprising place here. The "docstring" for the dictionary is implicitly concatenated to the first key. PYLTI_URL_FIX has a key that's 163 characters long: " Remap URL to ... URL.\nhttps://localhost:8000/", including three newlines.
But SECRET_KEY isn't affected. Why? Because the SECRET_KEY assignment line is a complete statement all by itself, so it doesn't continue onto the next line. Its "docstring" is a statement all by itself. The PYLTI_URL_FIX docstring is inside the braces of the dictionary, so it's all part of one 13-line statement. All the tokens are considered together, and the adjacent strings are concatenated.
As odd as this code was, it was still hard to see what was going to happen, because the first string was clearly meant as a comment, both in its token form (a multiline string, starting in the first column) and in its content (English text explaining the dictionary). The second string is clearly intended as a key in the dict (short, containing data, indented). But all of those signals are human signals, not computer signals. So I as a human attended to them and misunderstood what would happen when the computer saw the same text and ignored those signals.
The fix of course is to use conventional comments. Programming is hard, yo. Stick to the conventions.