|Ned Batchelder : Blog | Code | Text | Site|
A few months ago, I rejiggered a bit of my home page and sidebar, and I added a link to accept Bitcoin. (You know, in case you wanted to throw some my way!)
I couldn't find a simple description of how create a link that would let people send me Bitcoin, but I was able to create a link at Coinbase:
It seems to work, but is there a more Bitcoin-native way to do it? This links to a particular checkout at Coinbase, meaning I've set a particular amount. Coinbase also creates new addresses for every transaction, which confuses me.
Is there a link that will let me accept Bitcoin better than what Coinbase does?
I was just reminded of these two tweets, which do a great job capturing some non-technical aspects of being a developer.
Jamie Forrest said:
Elliot Loh said:
How true, how true!
When I wrote Facts and myths about Python names and values, I included figures drawn with Graphviz. I made a light wrapper to help, but the figures were mostly just Graphviz. For example, this code:
produced this figure:
The beauty of Graphviz is that you describe the topology of your graph, and it lays it out for you. The problem is, if you care about the layout, it is very difficult to control. Why are the 1, 2, 3 where they are? How do I move them? I couldn't figure it out. And I wanted to have new shapes, and control where the lines joined, etc.
So I wrote a library called Cupid to help me make the figures the way I wanted them. Cupid is an API that generates SVG figures. Half of it is generic "place this rectangle here" kinds of SVG operations, and half is very specific to the names-and-values diagrams I wanted to create.
Now to make that same figure, I use this code:
The new code is more complicated than the old code, but I can predict what it will do, and if I want it to do something new, I can extend Cupid.
Cupid isn't the handiest thing, but I can make it do what I want. This is my way with this site: I want it the way I want it, I enjoy writing the tools that let me make it that way, and I don't mind writing code to produce figures when other people would use a mouse.
I chatted this morning with Alex Gaynor about speeding up coverage.py on PyPy. He's already made some huge improvements, by changing PyPy so that settrace doesn't defeat the JIT.
We got to talking about speeding how coverage stores data during execution. What coverage stores depends on how you use it. For statement coverage, it needs to store a set of numbers for each file. The numbers are the line numbers that have been executed. Currently coverage stores these in a dictionary, using the numbers as keys, and None as a value.
If you use branch coverage, then coverage needs to store pairs of numbers, the line number jumped from and the line number jumped to, for each line transition. It currently stores these in a dictionary with a tuple of two numbers as keys, and None as a value. (Yes, it should really be an actual set.)
For statement coverage, a faster data structure would be a bit-vector: a large chunk of contiguous memory where each bit represents a number. To store a number, set the bit corresponding to the number. This would definitely be faster than the dictionary.
But for branch pairs, what options do we have? A general pair-of-integers data structure is probably too complex. But I had a hunch that our data had idiosyncracies we could take advantage of. In particular, for most of our pairs of numbers, the two numbers will be close together, since many branches are just a few statements forward in the code.
To test this hunch, I made a quick experiment. I reached for the nearest large project (Open edX), and changed its .coveragerc files to use the Python tracer. Then I hacked a counter into the Python tracer which would tally up, for each line transition, what was the delta between the source and the destination.
After running the tests, I had data like this:
The data showed what I suspected: there's a huge bulge around zero, since most branches are to points nearby in the code. There are also spikes elsewhere, like the 80k branches that went 351 lines forward. And I don't understand why the seven largest jumps all occurred 34 or 37 times?
In any case, this pointed to a possible storage strategy. My first idea was a handful of bit-vectors, say 10. To store the pair (a, b), you use the difference b-a (which is the value our data above shows) to choose a particular bit-vector, and store a in that vector. If b-a is more than 10, then you fallback a dictionary. Since most values of b-a are in that small range, we'll mostly use the bit-vector, and save time.
A variant of this idea is instead of using a bit-vector, use a vector of ints. To store (a, b), you set bit number b-a in the a'th int. If b-a is too large, then you fallback to a dictionary. To properly capture the biggest bulge around zero, you'd pick an offset like -2, and instead of b-a, you'd use b-a+offset as the bit index, so that a delta of -2 would be stored in bit 0.
A quick program tried out various values of the width of the int, and the offset from zero. The results:
This says that if we use a vector of bytes, with an offset of -1, then 73% of the data will use the vector, the rest would go to the dictionary. If we use 64-bit ints with an offset of -7, then 85% of them would be fast.
One factor not accounted for here: you'd also have to limit the length of the int vector, starting line numbers larger than the length of the vector would also have to go into the dictionary, but that should also be a small factor.
I'm not taking on any of this work now, but it's fascinating to work through the details. It might make an interesting pull request...
Seth Robertson has written a great tutorial on how to fix mistakes in git: On undoing, fixing, or removing commits in git. I love it for two reasons.
First, it really does help you find your way through a thicket of options in the confusing world of git. I love git, but it is a power tool that has hurt many, and good instructions for the baffled are hard to come by.
More interestingly, Seth has come up with an innovative way to provide the instructions. When writing detailed technical help, things are never straightforward, literally. There are always points in the flow where you have to say, "Now, you may have done A or you may have done B, and what happens next depends on your choice." Rather than try to keep all that in a linear English flow as most of us do, Seth has provided an actual branching structure to the text.
When you try to keep it linear, you end up with something that reads like the instructions for an IRS tax form, because they have a similarly complex documentation challenge. By actually branching, Seth has made it possible to focus on the situation you are actually in.
And although the page looks as bare-bones as they come (it could have been designed in 1994), the choices you make are recorded and appear as breadcrumbs right where you are in the text.
I can see extending this technique. Imagine in a set of instructions where the user has to choose the name of a directory, they enter the name into a text box, and then the rest of the instructions have the actual name filled in so the commands they have to type don't need metasyntactic placeholders like <YOURDIRECTORY>.
Help for confused people should be more helpful, and keeping track of the twisty paths like this is a really nice way to do it. I'd love to see other examples. Explaining things is hard, and new ways to do it are fascinating.
If you really want to write high-quality polished code, you have to attend to many details. A small one that is often overlooked: comments should be complete sentences. (Throughout, "comment" includes docstrings and any other English you write about your code.) Start comments with a capital, have a subject and a verb, and end them with a period.
There are a few reasons for this. First, your program should be readable. Just as you choose variable names to have meaning so that people can read your code, you should write your comments so that people can read them. Capitals are a clear indication of the start of a sentence, and a period is a clear indication of the end:
Is that first comment complete? Was there meant to be a qualifying clause? The thing that what? Sure, the second sentence doesn't explain the thing either, but at least we can see that the author didn't just wander away in the middle of a thought.
By writing complete sentences you are more likely to include some small helper that will get your meaning across, and people will be better able to grasp what your words mean. Isn't that why you wrote them?
The second reason to write complete sentences is to focus you on what you are writing. You think about each statement in your program, why aren't you thinking about the comments? Maybe if you capitalize and punctuate your text, you'll realize that a few grunted words aren't really what you want to put there:
Some would argue that this comment is too many words, that you shouldn't explain in that much detail, that the code should be more self-explanatory. That's great, now you're thinking about what the comment is doing, and making them better. I'm not suggesting writing longer comments just for bulk, if you can say it in fewer words, do, but make them good words in a complete sentence.
Paying more attention to the comments will help you write better code. I can't tell you how many times I've written what I thought was a perfecty good function or line of code, then gone to write the comment or docstring, and realized a better way to do it, or even just a better name. Explaining yourself to others is a really good way to understand what you are doing. Understanding what you are doing is a really good way to write good code.
Another reason to make your comments complete sentences: it makes it easier to extend them later:
The person who adds the second sentence had to add the period to the first, and now it looks really strange that the second sentence is unterminated. If the first sentence had been a real sentence, the second sentence could be added naturally.
Many coders will look at this advice and complain that it is way too nit-picking, that punctuation in comments is irrelevant, that since it's natural language, it's readable as it is, we don't have to worry about trivialities like punctuation they are wrong text needs punctuation to be readable leaving it out just makes it hard to parse the sentences see what i did there?
One last reason for full sentences: the programming variant of the broken windows theory says that if you take care of small things, others are more likely to take care of the bigger things. Polished code is more likely to be maintained well and will set the tone for more polished code in the future.
And isn't that what we all want? Write complete sentences.
While writing a test suite, I wrote a helper function to renumber the ids in SVG figures. It had enough interesting bits that I'll share it here, and maybe I'll get suggestions on better ways to do it.
BTW: The test is for an SVG-drawing library that I hacked together to replace the graphviz diagrams I made for Facts and myths about Python names and values. Graphviz is frustrating if you know what you want things to look like, so when I turned that article into a presentation, I re-did the diagrams in SVG. Now I want to make that library more formal, so I need tests!
The first tests are just the figures from the presentation, packaged as unit tests. But the figures have ids in them which are auto-assigned, and if the tests run in a different order than the original figures, the ids will be different. So I wrote a helper function that finds the ids and renumbers them, to canonicalize the SVG.
I chose to use regexes, since formally parsing the SVG to find the ids would involve not just XML parsing but CSS parsing, and this domain is specialized enough and tightly-controlled enough that I'm condfident that a regex will do a good job.
First the code, then we'll go over it line by line:
At heart, this function is conceptually simple: take a string, and return the string with the ids replaced. Since the same id can appear multiple times in the string, we need to be careful to replace the same id with the same replacement. To keep track of what was replaced with what, at line 10, id_map is a dictionary mapping old ids to new ids.
At line 11, we have a generator expression that will make new ids for us. itertools.count is an infinite sequence of integers; we format those into the form "newid123", and the generator expression gives us an infinite stream of those ids.
The heart of renumber_svg_ids is a function for use with re.sub. The simple and common way to use re.sub is to give it a regex pattern and a string replacement. But instead of a string replacement, you can use a function. Every match is passed to the function as a match object, and the string returned by the function is used as the replacement.
Our function new_repl on line 13 takes a match object and a format string for the replacement. Line 20 gets the actual id out of the match object: match.group(1) returns the string matched by the first parenthesized group in the regex pattern, so found_id will be something like "id123".
On line 21, if the id isn't in our map, then we haven't seen this id yet, so we make a new id by pulling the next value from our generator expression. Generators are usually consumed in a for-loop of some kind, but you can use the next() builtin to just grab the next value from one.
Finally we use the new_id_fmt format string, giving it the new id, and return the result on line 23.
It's unusual to see nested functions in Python, but they work fine. One issue is variable scope: notice that we use id_map inside the new_repl function, but id_map is defined in the outer function. This works so long as we don't reassign the id_map name. That won't work right in Python 2, you'd need the Python 3 nonlocal keyword. Luckily we don't need to reassign the name, we just use methods on it. Those methods modify the value, but that's still not an assignment to the name, so we are OK.
Now that we have our re.sub replacement function, we're going to use it twice, once to replace "id='id123'" instances, and once for "#id123". You may have noticed an odd thing about our new_repl function: it takes two arguments. But the function re.sub will call only takes one: the match object. We need two arguments so that the replacement format could be different for the two times we're going to use it.
To turn our two-argument function into two different one-argument functions, we use functools.partial. You give it a function, and some arguments, and it returns a new function that will call your function with those arguments pre-supplied. In our case, line 28 uses functools.partial to make a new function that is our new_repl with the given string as new_id_fmt. The result is a function of only the one remaining argument, the match object, which is just what re.sub wants.
Lines 26 and 32 are our two calls to re.sub, they each make replacements in the svg string, and the final result is returned at the end.
A few minor things to note: on line 14, the docstring for new_repl is a raw string, because I have a backslash in it that I want to remain literal, although the "\1" is an obscure way to refer to the first group, and in any case the docstring of an inner function is unlikely to ever appear anywhere else, so who's reading it? On line 27, I used a triple-quoted string even for a single-line string, because it let me avoid escaping the two kinds of quotes I have in the regex.
Of course, there's still room for new ways to do things. Line 21, the check if the value is already in the dictionary, raises an eyebrow: Python has better ways to do that sort of thing. The defaultdict class can automatically create values for missing keys.
So we can re-write the top of our function like this (with docstrings removed for brevity):
The new_ids generator is exactly the same. But now we use it in a defaultdict. When a key is missing, defaultdict will invoke the lambda function, which will use next() to get the next id. Now the body of new_repl has no conditional in it at all, it simply looks up the found id in the map. If it's not there already, the defaultdict will make a new one, and if it is there, it will simply return the saved value. For bonus points, you could replace our new lambda function with another call to functools.partial.
In the back of my mind, I'm wondering if there isn't a better way to accomplish this entirely. Maybe find all the ids in one pass, and then replace them all in another?
Welcome to 2014! Who knows what it will bring? In 2013 I joined edX full-time (including open-sourcing the whole thing) and had a little surgery. On the side, I was greatly drawn to Python community work, notably Boston Python. I didn't write as much here (or elsewhere!) as I thought I would, maybe that's OK, though that's definitely something I would like to improve on.
At home, I watched and helped as my family continued to develop their own paths in various ways: Nat in a group home; Max at NYU; Ben in high school and the arts; and Susan as a writer and advocate.
For the record, here's us:
And here's us, more realistically:
I couldn't have predicted 2013, so I won't try to predict 2014. As always, here's to living mindfully, and having it come out the way you want!
Python is well-known for its duck-typing: objects are examined for what they can do rather than for what type they are. But if you like being strict about the methods derived classes have to implement, you can use the abstract base classes in the abc module.
They let you define a class, with some methods defined as abstract, and if those methods aren't defined in a subclass, the subclass can't be instantiated:
This is great when you want to be strict, and can remind you of your pleasant days writing Java! But like Java, you can find yourself in situations where you have an abstract base class with a handful of abstract methods, and know that you only need a few of them. The usual remedy at this point is to define all the missing methods knowing they'll never be called. This is the worst of "keeping the compiler happy": you know what you need, but the type checking insists that you go through the motions.
Here's another option: a class decorator that erases the list of abstract methods, so that the class can be instantiated:
Now we can make a subclass of our abstract base class, not define any methods, and still instantiate the class:
If we want to get fancier, we can! The missing abstract methods aren't going to be called (we think!) but we can provide stub methods just in case. The stub methods will raise an error with a message naming the method. For extra bells and whistles, the message will be settable in the decorator, and the decorator will be usable with or without a customized message:
Here the _unabc function is the actual decorator. It loops over all the abstract method names, and makes a new stub method for each one. The make_stub_method function is needed because we need to close over the ab_name variable so it will have the proper value when called.
Then stub_method is defined as the actual method that will be added to the class with setattr. Yes, this is four defs nested inside each other: one to define the decorator you use, one to be the actual decorator applied to the class, one to form a closure so we can define stub methods, and one to create the stub methods themselves!
The last part here is to deal with the two ways the unabc decorator can be used: if it's used without an argument, then the class in question will be the argument, and the isinstance check will be true. In that case, we'll use the argument as the class, and provide a default message. If the argument isn't a class, then we return _unabc, and the argument is already provided as a default msg for the _unabc function.
BTW: all the code above is Python 3. The only thing to change for Python 2 is how the ABCMeta metaclass is associated with your abstract class:
I was experimenting with pydoc yesterday, and was baffled by how it was running. Turns out Mac OS X does some tricky stuff to support multiple versions of Python.
If you type "pydoc" at a shell prompt, it works properly:
If you ask which file is being run, it's /usr/bin/pydoc, and you can look at that file:
Notice that this file will always write "python version X.Y.Z can't run...," which is not the output we're getting. Weird!
(BTW, if you have activated a virtualenv, you may have an alias, so that "pydoc" is actually "python -m pydoc". Is there an equivalent to "which" that will include that fact it in its output?)
But even without the virtualenv alias, this file isn't being run. Why not? The answer is in the shebang line:
Unix uses the shebang line to find a program to run the file. So typing "pydoc" at the prompt will find /usr/bin/pydoc, then find the shebang line, and will actually run this:
Seems simple enough: invoke Python, and have it run the code in /usr/bin/pydoc. So why isn't Python running the Python code we saw? The answer is that /usr/bin/python is not a Python interpreter!
On OS X, /usr/bin/python examines various settings and then invokes a real Python interpreter of the correct version: python.1 man page. A quick look at the readable text inside the executable confirms that it is not a full interpreter, and that it is concerned with versions:
So this finder is given "/usr/bin/pydoc" as an argument, and it decides what to really run. It's special-casing "/usr/bin", and actually invoking /usr/bin/pydoc2.7. The real /usr/bin/pydoc file is there only to be executed when the version-selection mechanism fails, which is why it simply prints messages about not being able to find the right version.
It seems that the switcher doesn't care much what command you're trying to run. If it's in /usr/bin, and there's a file alongside it with the same name but the current Python version appended, then it will run the versioned one.
All this can be verified by an experiment. I created /usr/bin/foo with these contents:
and a /usr/bin/foo2.7 with this:
Then I ran it a number of different ways (the current directory is in the prompt):
Notice that the Python switcher runs the versioned file whenever it's identified as being from /usr/bin. But when run so that the shell doesn't identify it that way, even when it's the exact same file, the Python switcher decides it shouldn't interfere, and it runs the exact file specified.
I've just released the latest version of coverage.py: coverage.py v3.7.1. The only changes are performance improvements in the HTML report, and a little restructuring to make Debian packaging a little easier.
The next version will be 4.0, and I'm looking for feedback about major changes you'd like to see. Coverage.py has never had an API-centric mindset, for example. Breaking backward compatibility will be OK if there's a good reason, and I'm dropping support for older Pythons, so it should be easier to make changes.
Send me your ideas.
I help people with Python questions, and often those people are only casual acquaintances (at Boston Python) or complete strangers (in the #python IRC channel). There's a common misunderstanding that can crop up in these situations. It has to do with asking why.
While helping, I often ask the asker, "why?" For example, someone will need help with the two Python installations on their laptop. I'll ask, "Why did you install a second Python?" When I did this the other night, another guy chuckled, because he thought I meant, "You shouldn't have installed a second Python."
This is a common reaction. Asking why is perceived as a rebuke. "Why did you XYZ?" is taken to mean, "You dummy, you shouldn't have XYZed!" But when I ask it, I really do mean, "I want to understand what led you to XYZ."
English can be difficult that way, especially over a purely textual, cue-less medium like IRC. One way to soften the apparent bluntness is to add more words. For example, instead of:
Why did you install another Python?
this might go over better:
Do you mind if I ask you why you installed another Python? Understanding the reason might provide an important clue.
So if I ask you why, don't take it personally. I really do want to know the reason for something. If I want to chide you for making a mistake, I'll say something like, "You shouldn't have XYZed," or, "I don't think it was a good idea to XYZ."
Sometimes even when people understand I'm looking for reasons, they bristle at the question. They insist the reasons aren't important, and why can't I just answer a simple question?
I ask because 75% of the time, my answer changes once I know the bigger picture. It's common for people to start with a problem, and work towards a solution, and get stuck. Then they ask about the thing they are stuck on. That's natural. But it's also common that the reason they are stuck is there was a better path to start down in the first place. Understanding the reasons can help find that better path.
Asking why can help find those earlier choices, and lay out the entire problem. Having all the information means we can work together to find the best solution.
If I ask you to explain the larger problem, don't take it personally. Solving complex problems is hard, and it's easy to choose a first step that makes the second step more difficult than it needs to be. When a helper asks about the bigger picture, they aren't trying to make you look silly, they're trying to help you find the best solution.
This phenomenon has a name: the XY problem. You have problem X, you choose solution Y, and when it doesn't work, you ask about Y instead of asking about X. Despite some people's disdain for this dynamic, it's very common and people shouldn't be put down for it. Sometimes the asker can't see that there were alternatives earlier on. Sometimes the asker is trying to not take up too much of the helper's time by asking a focused question. Whatever the reason, it happens all the time, and helpers should get out of the helping business if XY problems make them mad.
Askers: if someone asks you why, or says you have an "XY problem," that means they are trying to help you find the best solution. Don't feel bad. We're all struggling together to learn and overcome complexity.