|Ned Batchelder : Blog | Code | Text | Site|
At edX, I help with the Open edX community, which includes being a traffic cop with the flow of pull requests. We have 15 or so different repos that make up the entire platform, so it's tricky to get a picture of what's happening where.
So I made a chart:
The various teams internal to edX are responsible for reviewing pull requests in their areas of expertise, so this chart is organized by teams, with most-loaded at the top. The colors indicate the time since the pull request was opened. The bars are clickable, showing details of the pull requests in each bunch.
This was a fun project because of the new stuff I got to play with along the way. The pull request data is gathered by a Python program running on Heroku, using the GitHub API of course. The summary of the appropriate pull requests are stored in a JSON file. A GitHub webhook pings Heroku when a pull request changes, and the Python updates the JSON.
Then I used d3.js in the HTML page to retrieve the JSON, slice and dice it, and build an SVG chart. The clickable bars open to show HTML tables embedded with a foreignObject. This was complicated to get right, but drawing the tables with SVG would be painful, and drawing the bars with HTML would be painful. This let me use the best tool for each job.
D3.js is an impressive piece of work, but took some getting used to. Mike Bostock's writings helped explain what was going on. The key insight: d3 is not a charting library. It's a way to use data to create pages, of turning data into DOM nodes.
So far, the chart seems to have helped edX stay aware of how pull requests are moving. It hasn't made everything speedy, but at least we know where things are stalled, and it has encourage teams to try to avoid being at the top. I'd like to add more to it, for example, other ways of sorting and grouping, and more information about the pull requests themselves.
As the maintainer of coverage.py, it's always been intriguing that web applications have so much code in template files. Coverage.py measures Python execution, so the logic in the template files goes un-measured.
Recently I started experimenting with measuring templates as well as pure Python code. Mako templates compile to Python files, which are then executed. Coverage.py can see the execution in the compiled Python files, so once we have a way to back-map the lines from the Mako output back to the Mako template, we have the start of a usable Mako coverage measurement.
This Mako experiment is on the tip of the coverage.py repo, and requires some code on the tip of Mako. The code isn't right yet, but it shows the idea. Eventually, this should be a plugin to coverage.py provided by Mako, but for now, we're just trying to prove out the concept.
If you want to try the Mako coverage (please do!), configure Mako to put your compiled .py files someplace convenient (like a mako/ directory in your project), then set this environment variable:
Jinja also compiles templates to Python files, but Django does not. Django is very popular, so I would like to support those templates also. Dmitry Trofimov wrote dtcov to measure Django template coverage. He does a tricky thing: in the trace function, determine if you are inside the Django template engine, and if so, walk the stack and look at the locals to grab line numbers.
As written dtcov looks too compute-intensive to run on full-sized projects, but I think the idea could work. I'm planning to experiment with it this week.
I had coffee the other day with Nathan Kohn. He goes by the nickname en_zyme, and it's easy to see why. He relishes the role of bringing pairs of people together to see what kind of new reaction can result.
This time, it was to meet Jonathan Henner, a doctoral student of his at Boston University. The topic was how to include deaf people in the Python community.
The discussion was wide-ranging, and I'm sure I've forgotten interesting tangents, but I got this jumble of notes:
Accommodating the deaf at Python community gatherings is a challenge because it means getting either an ASL interpreter, or a CART provider to close-caption presentations live. This presents a few hurdles:
Programming is a good career for the deaf, since it is heavily textual, but they may have a hard time accessing the curriculum for it. Jonathan is exploring the possibility of creating classes in ASL, since that is many deaf people's first language. A common misconception is that ASL is simply English spoken with the hands, but it is not.
We talked a bit about the overlap between the deaf and autistic worlds. The Walden school near Boston specializes in deaf students with other mental or emotional impairments, including autism. Jonathan made a claim that made me think: that deafness and autism are the two disabilities that have their own sub-culture. I don't know if that is true, I'm sure people with other disabilities will disagree, but it's interesting to discuss.
There were a lot of avenues to explore, I'm not sure what will come of it all. It would be great to broaden Python's reach into another community of people who haven't had full access to tech.
Has anyone had any experience doing this? Thoughts?
I continue to notice an unsettling trend: the rise of the GitHub monoculture. More and more, people seem to believe that GitHub is the center of the programming universe.
Don't get me wrong, I love GitHub. It succeeded at capturing and promoting the social aspect of development better than any other site. And git, despite its flaws, is a great version control system.
And just to be clear, I am not talking about the recent turmoil about GitHub's internal culture. That's a problem, but not the one I'm talking about.
Someone said to me, "I couldn't find coverage.py on GitHub." Right, because it's hosted on Bitbucket. When a developer thinks, "I want to find the source for package XYZ," why do they go to the GitHub search bar instead of Google? Do people really so believe that GitHub is the only place for code that it has supplanted Google as the way to find things?
(Yes, Google has a monopoly on search. But searching with Google today does not lock me in to continuing to search with Google tomorrow. When a new search engine appears, I can switch with no downside whatsoever.)
Another example: I'm contributing a chapter to the 500 lines book (irony: the link is to GitHub). Here in the README, to summarize authors, we are asked to provide a GitHub username and a Twitter handle. I suggested that a homepage URL is a more powerful and flexible way for authors to invite the curious to learn more about them. This suggestion was readily adopted (in a pending pull request), but the fact that the first thing to mind was GitHub+Twitter is another sign of people's mindset that these sites are the only places, not just some places.
Don't get me started on the irony of shops whose workflow is interrupted when GitHub is down. Git is a distributed version control system, right?
Some people go so far as to say, as Brandon Weiss has, GitHub is your resume. I would hope they do not mean it literally, but instead as a shorthand for, "your public code will be more useful to potential employers than your list of previous jobs." But reading Brandon's post, he means it literally, going so far as to recommend that you carefully garden your public repos to be sure that only the best work is visible. So much for collaboration.
There is power in everyone using the same tools. GitHub succeeds because it makes it simple for code to flow from developer to developer, and for people to find each other and work together. Still, other tools do some things better. Gerrit is a better code review workflow. Mercurial is easier for people to get started with.
GitHub has done a good job providing an API that makes it possible for other tools to integrate with them. But if Travis only works with GitHub, that just reinforces the monoculture. Eventually someone will have a better idea than GitHub, or even git. But the more everyone believes that GitHub is the only game in town, the higher the barrier will be to adopting the next great idea.
I love git and GitHub, but they should be a choice, not the only choice.
PyCon 2014 is over, and as usual, I loved every minute. There are a huge number of people that I know there, and about 5 different sub-communities that I feel an irrationally strong attachment to.
My head is still spinning from the high-energy four days I've had, I'm sure I'm leaving out an important high point. I just love every minute!
On the downside, I did not see as much of Montreal as I would have liked, but we'll be back for PyCon 2015, so I have a second chance!
My youngest son Ben turns 16 in a few days, and a few days ago was accepted into the RISD summer program for high-schoolers! So today, his cake:
He's really excited about RISD. It will be a big transition for him, six weeks away from home, doing serious art instruction. I'm really proud of him, and eager to see what changes it will bring. I'm also nervous about that...
The cake was fun, it's not often you get to try your hand at a calligraphic challenge in frosting!
Happy Pi day! Celebrate with delicious circular pastries!
In an ongoing (and wildly premature!) thread on the Python-Dev mailing list, people are debating the possible shapes of Python 4, and Barry pointed out that Guido doesn't like two-digit minor versions.
I can understand that, version numbers look like floats, and 3.1 is the same as 3.10. But in this case I think we should forge on to 3.10, 3.11, etc. Partly to avoid the inevitable panic that a switch to 4.x will entail, no matter what the actual semantics. But mostly so that we can get to version 3.14, which will of course be known as PiPy, joining in merry celebration with friends PyPI and PyPy!
A few months ago, I rejiggered a bit of my home page and sidebar, and I added a link to accept Bitcoin. (You know, in case you wanted to throw some my way!)
I couldn't find a simple description of how create a link that would let people send me Bitcoin, but I was able to create a link at Coinbase:
It seems to work, but is there a more Bitcoin-native way to do it? This links to a particular checkout at Coinbase, meaning I've set a particular amount. Coinbase also creates new addresses for every transaction, which confuses me.
Is there a link that will let me accept Bitcoin better than what Coinbase does?
I was just reminded of these two tweets, which do a great job capturing some non-technical aspects of being a developer.
Jamie Forrest said:
Elliot Loh said:
How true, how true!
When I wrote Facts and myths about Python names and values, I included figures drawn with Graphviz. I made a light wrapper to help, but the figures were mostly just Graphviz. For example, this code:
produced this figure:
The beauty of Graphviz is that you describe the topology of your graph, and it lays it out for you. The problem is, if you care about the layout, it is very difficult to control. Why are the 1, 2, 3 where they are? How do I move them? I couldn't figure it out. And I wanted to have new shapes, and control where the lines joined, etc.
So I wrote a library called Cupid to help me make the figures the way I wanted them. Cupid is an API that generates SVG figures. Half of it is generic "place this rectangle here" kinds of SVG operations, and half is very specific to the names-and-values diagrams I wanted to create.
Now to make that same figure, I use this code:
The new code is more complicated than the old code, but I can predict what it will do, and if I want it to do something new, I can extend Cupid.
Cupid isn't the handiest thing, but I can make it do what I want. This is my way with this site: I want it the way I want it, I enjoy writing the tools that let me make it that way, and I don't mind writing code to produce figures when other people would use a mouse.
I chatted this morning with Alex Gaynor about speeding up coverage.py on PyPy. He's already made some huge improvements, by changing PyPy so that settrace doesn't defeat the JIT.
We got to talking about speeding how coverage stores data during execution. What coverage stores depends on how you use it. For statement coverage, it needs to store a set of numbers for each file. The numbers are the line numbers that have been executed. Currently coverage stores these in a dictionary, using the numbers as keys, and None as a value.
If you use branch coverage, then coverage needs to store pairs of numbers, the line number jumped from and the line number jumped to, for each line transition. It currently stores these in a dictionary with a tuple of two numbers as keys, and None as a value. (Yes, it should really be an actual set.)
For statement coverage, a faster data structure would be a bit-vector: a large chunk of contiguous memory where each bit represents a number. To store a number, set the bit corresponding to the number. This would definitely be faster than the dictionary.
But for branch pairs, what options do we have? A general pair-of-integers data structure is probably too complex. But I had a hunch that our data had idiosyncracies we could take advantage of. In particular, for most of our pairs of numbers, the two numbers will be close together, since many branches are just a few statements forward in the code.
To test this hunch, I made a quick experiment. I reached for the nearest large project (Open edX), and changed its .coveragerc files to use the Python tracer. Then I hacked a counter into the Python tracer which would tally up, for each line transition, what was the delta between the source and the destination.
After running the tests, I had data like this:
The data showed what I suspected: there's a huge bulge around zero, since most branches are to points nearby in the code. There are also spikes elsewhere, like the 80k branches that went 351 lines forward. And I don't understand why the seven largest jumps all occurred 34 or 37 times?
In any case, this pointed to a possible storage strategy. My first idea was a handful of bit-vectors, say 10. To store the pair (a, b), you use the difference b-a (which is the value our data above shows) to choose a particular bit-vector, and store a in that vector. If b-a is more than 10, then you fall back to a dictionary. Since most values of b-a are in that small range, we'll mostly use the bit-vector, and save time.
A variant of this idea is instead of using a bit-vector, use a vector of ints. To store (a, b), you set bit number b-a in the a'th int. If b-a is too large, then you fall back to a dictionary. To properly capture the biggest bulge around zero, you'd pick an offset like -2, and instead of b-a, you'd use b-a+offset as the bit index, so that a delta of -2 would be stored in bit 0.
A quick program tried out various values of the width of the int, and the offset from zero. The results:
This says that if we use a vector of bytes, with an offset of -1, then 73% of the data will use the vector, the rest would go to the dictionary. If we use 64-bit ints with an offset of -7, then 85% of them would be fast.
One factor not accounted for here: you'd also have to limit the length of the int vector, starting line numbers larger than the length of the vector would also have to go into the dictionary, but that should also be a small factor.
I'm not taking on any of this work now, but it's fascinating to work through the details. It might make an interesting pull request...