Candy in my pocket

Saturday 18 November 2017

Let me tell you a modern-day horror story: for almost ten days, I didn't have a phone!

I was on a day trip to Montréal, and my phone just completely died. I thought maybe it just needed a charge, but nope, nothing would bring it back. I had a nicely sculpted chunk of glass.

(Side note: I had been texting with Susan, so eventually I dashed off a quick email: "My phone is completely dead. I can't even tell what time it is." She sent back an email that said just, "It's 11:24." Is it any wonder I love her?)

At first, I felt a bit lost. I couldn't take pictures, I couldn't use maps, I couldn't text with Susan to coordinate getting picked up at the airport.

But what I noticed is that much of what I was used to doing with my phone, I didn't really miss. I didn't have games to jump to when I had a free moment. I wasn't able to reflexively look up interesting tidbits. I couldn't anxiously check if I had gotten an interesting email.

I realized I didn't need those things. It was OK to not have a smaller screen to use when I wasn't using my larger screen. I started to feel like the phone was like a bag of candy in my pocket. If my pocket is full of candy, I'll nibble on candy all day long. But when I didn't have a bag of candy, I didn't miss it. Sure, sometimes I'd like to have a piece of candy, but not nearly as much as I ate when I always had a bag of candy with me.

Now I finally (USPS can be disappointing...) have a new phone, a Google Pixel. I'll be glad to have my podcasts back. I can take pictures again. I was lucky not to need a two-factor app in the last week.

I'll have to see how I relate to it. I'll have the candy with me, but will I be able to resist nibbling? I wasn't as bad as some: I never had the impulse to read my phone while standing at a urinal, for example. But I nibbled a lot more than it turns out I needed to. I'll try to keep in mind how these last ten days felt, and snack responsibly.

Finding your first OSS project

Saturday 11 November 2017

Running in the circles I do, I often hear the question, "Where's a good open source project to start off contributing to?" The last time this came up, I asked on Twitter and got some good replies.

The best answers pointed to two aggregators of projects. These sites collect links to projects that have special labels for bug reports that are good for first-time contributors to work on. The presence of these labels is a good indicator that the project is well-maintained, welcoming to newcomers, and prepared for their contributions.

  • Up For Grabs lists dozens of projects, helpfully showing how many open first-timer issues each has.
  • Awesome for Beginners is lower-tech, but also lists projects with links to their first-timer tagged issues.

I also got links to some useful advice for first-time contributors:

Making a first contribution can be overwhelming. Keep looking through these resources until you find something that makes it feel do-able.

Toxic experts

Thursday 2 November 2017

I wrote Big-O: how code slows as data grows to explain Big-O notation. My goal was to explain it in broad strokes for people new to the idea. It grew out of thinking I'd been doing about beginners and experts. In the comments, there was an unexpected and unfortunate real-world lesson.

An anonymous coward named "pyon" said I should be ashamed. They pointed out a detail of algorithmic analysis that I had not mentioned. It's a detail that I had never encountered before. I think it's an interesting detail, but not one that needed to be included.

Pyon is an example of a toxic expert. People like this know a lot, but they use that knowledge to separate themselves from the unwashed masses of newbs. Rather than teach, they choose to sneer from their lofty perches, lamenting the state of the world around them, filled as it is with People Who Don't Want To Learn.

The important skill pyon and other toxic experts are missing is how to connect with people. They could use their knowledge to teach, but it's more important to them to separate themselves from others. Points of correctness are useless without points of connection.

Toxic experts care more about making distinctions between people to elevate themselves than they do about helping people. Beware: they are everywhere you look in the tech world. It's easy to run into them when you are trying to learn. Ignore them. They don't know you, and they don't know what you can do.

Pyon is fixated on a particular detail of algorithmic analysis, and feels that it is essential to understanding Big-O. I can tell you is that I am doing fine in my 30-year career, and I had never heard that particular detail. My Big-O piece wasn't meant to be exhaustive. There are entire books written about algorithmic notation. I even wrote at the end, "There's much more to algorithm analysis if you want to get into the true computer science aspects of it, but this is enough for working developers."

But pyon can't see the forest for the trees. Experts have spent a lot of time and energy learning what they know. They love their knowledge. They wouldn't have been able to get where they are without a passion for the subject. But sometimes they have a hard time seeing how people can be successful without that well-loved knowledge. They've lost sight of what it means to be a beginner, and what beginners need to learn.

Toxic experts will latch onto a particular skill and decide that it is essential. For them, that skill is a marker dividing Those-Who-Know from Those-Who-Don't. These shibboleths vary from expert to expert. In the current case, it's a detail of algorithmic analysis. I've seen other toxic experts insist that it's essential to know C, or assembly language, or recursion and pointers, and so on.

I'm not saying those aren't good things to know. The more you know, the better. Every one of these topics will be useful. But they are not essential. You can do good work without them. You certainly don't deserve to be spat upon.

The ultimate irony is that while pyon and other toxic experts are bemoaning the state of the industry because of missing knowledge, they are highlighting the main skill gap the tech industry needs to fix: empathy.

How code slows as data grows

Wednesday 18 October 2017

One of the parts of the vacation talk I did in September at Boston Python was about big-O notation. I've noticed that topic seems to be some kind of dividing line for people who feel insecure about not having a computer science degree. I wanted to explain it in simple practical terms so that people could understand it well enough to inform their choices during everyday coding.

I liked how it came out, so I wrote it up as a standalone piece: Big-O: how code slows as data grows.

Beginners and experts

Saturday 23 September 2017

I gave a talk at Boston Python the other night. It started as an exposition of the point matching algorithm I've previously written about on this blog. But as I thought about my side project more, I was interested to talk about the challenges I faced while building it. Not because the specifics were so interesting, but because they were typical problems that all software projects face.

And in particular, I wanted to underscore this point: software is hard, even for experts. Experts have the same struggles that beginners do.

I used this tweet as an illustration:

The two states of being a developer

I love the raw emotion on the two boys' faces. They perfectly illustrate both the frustration and exhilaration of writing software.

But here's what beginners might not understand: beginners think beginners feel the frustration, and experts feel the exhilaration. As any expert will tell you, experts feel plenty of frustration. They feel like that left-hand kid a lot.

The difference between beginners and experts is that experts are familiar with that frustration. They encounter it all the time, as they deal with new unfamiliar technologies, or a thorny bug, or just when they are too tired to tackle the problem before them. They know the frustration is because they are facing a challenging problem. They know it isn't a reflection of their self-worth or abilities. Experts know the feeling will pass, and they have techniques for dealing with the frustration.

When beginners get frustrated, they can start to worry that they are not cut out for software, or they are dumb, or that everyone else gets it and they don't.

The good news for beginners is: this isn't about you. Software is difficult. We build layer upon layer of leaky abstractions, of higher and higher realms of virtualization and complexity. There's no natural limit to how high our towers of complexity can go, so we are forced to try to understand them, or tame them, and it's very hard. Our languages and tools are obscure and different and new ones are invented all the time. Frustration is inevitable.

The bad news for beginners is: this feeling won't stop. If you do your software career right, then you will always be a newb at something. Sure, you can master a specialty, and stick with it, but that can get boring. And that specialty might dwindle away leaving you stranded.

You will be learning new things forever. That feeling of frustration is you learning a new thing. Get used to it.

New backups: Arq to Wasabi

Sunday 27 August 2017

This week CrashPlan announced they were ending consumer services, so I had to replace it with something else. Backups are one of those things at the unpleasant intersection of tedious, difficult, and important.

A quick spin around the latest alternatives showed the usual spectrum of possibilities, ranging from perl hackers implementing rsync themselves, to slick consumer tools. I need to have something working well not just on my computer, but others in my family, so I went the consumerish route.

Arq backing up to Wasabi seems like a good choice for polish and price.

One thing I always struggle with: how to ensure my stuff is backed up, without needlessly copying around all the crap that ends up in my home directory that I don't need backed up. On a Mac, the ~/Library directory has all sorts of stuff that I think I don't need to copy around. Do I need these?:

  • Library/Application Support
  • Library/Caches
  • Library/Containers

I add these directories to the exclusions. Should my Dropbox folder get backed up? Isn't that what Dropbox is already doing?

Then as a developer, there's tons more to exclude. Running VirtualBox? You have have a 10Gb disk image somewhere under your home. I have something like 20,000 .pyc files. The .tox directory for coverage.py is 350Mb.

So I also exclude these:

  • .git
  • .hg
  • .svn
  • .tox
  • node_modules
  • .local
  • .npm
  • .vagrant.d
  • .vmdk
  • .bundle
  • .cache
  • .heroku
  • .rbenv
  • .gem
  • *.pyc
  • *.pyo
  • *$py.class

Of course, as a native Mac app for consumers, Arq doesn't provide a way that I can supply all these once, I have to fiddle with GUI + and - buttons, and enter them one at a time...

Lastly, some files don't seem comfortable with backups. Thunderbird's storage files are large, and while Arq copies only certain byte ranges, they still amount to about 300Mb each time. Should I even back up my email? Should I still be using Thunderbird? Too many uncertainties....

Coverage.py podcast

Sunday 6 August 2017

I was a guest on the Podcast.__init__ podcast this week: Coverage.py with Ned Batchelder. We talk about coverage.py, how I got started on it, why it's good, why it's not good, how it works, and so on:

I've only listened to a small part of it. Getting ideas out verbally is different than in writing: there's no chance to go back and edit. To me, I sound a bit halting, because I was trying to get the words and sentences right in my head before I said them. I hope it sounds OK to others. Also, I think I said "dudes" and "guy" when I could have chosen more neutral words, sorry about that.

Tobias has been doing a great job with this podcast. It's not easy to consistently put out good content (I hate that word) on a regular schedule. He's been finding interesting people and giving them a good place to talk about their Python world, whatever that means to them.

BTW, this is my second time on Podcast.__init__, and it seems I never mentioned the first time in this blog, so here it is, from two years ago: Episode 5: Ned Batchelder. The focus then was the user group I organize, Boston Python, so a completely different topic:

And in the unlikely case that you want yet more of my dulcet tones, I was also on the Python Test podcast, mentioned in this blog post: The Value of Unit Tests.

Look around you

Sunday 23 July 2017

I've been trying my Instagram experiment for a year now. I've really liked doing it: it gives me a prompt for looking around me and seeing what there is to see.

One of the things that surprised me when I looked back is how many pictures I took in a very small area, one that I would have thought of as uninteresting: the few blocks between where I swim in the morning, and where I work. Of the 197 pictures I took in the last year, 38 of them are from that neighborhood:

Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot Something I shot

I'm not saying these are all masterpieces. But I wouldn't have thought to take them at all if I hadn't been explicitly looking for something interesting to shoot.

Look around you: not much is truly uninteresting.

Finding fuzzy floats

Sunday 9 July 2017

For a 2D geometry project I needed to find things based on 2D points. Conceptually, I wanted to have a dict that used pairs of floats as the keys. This would work, except that floats are inexact, and so have difficulty with equality checking. The "same" point might be computed in two different ways, giving slightly different values, but I want them to match each other in this dictionary.

I found a solution, though I'm surprised I didn't find it described elsewhere.

The challenges have nothing to do with the two-dimensional aspect, and everything to do with using floats, so for the purposes of explaining, I'll simplify the problem to one-dimensional points. I'll have a class with a single float in it to represent my points.

First let's look at what happens with a simple dict:

>>> from collections import namedtuple
>>> Pt = namedtuple("Pt", "x")
>>>
>>> d = {}
>>> d[Pt(1.0)] = "hello"
>>> d[Pt(1.0)]
'hello'
>>> d[Pt(1.000000000001)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: Pt(x=1.000000000001)

As long as our floats are precisely equal, the dict works fine. But as soon as we get two floats that are meant to represent the same value, but are actually slightly unequal, then the dict is useless to us. So a simple dict is no good.

To get the program running at all, I used a dead-simple structure that I knew would work: a list of key/value pairs. By defining __eq__ on my Pt class to compare with some fuzz, I could find matches that were slightly unequal:

>>> class Pt(namedtuple("Pt", "x")):
...     def __eq__(self, other):
...         return math.isclose(self.x, other.x)
...
>>> def get_match(pairs, pt):
...     for pt2, val in pairs:
...         if pt2 == pt:
...             return val
...     return None
...
>>> pairs = [
...     (Pt(1.0), "hello"),
...     (Pt(2.0), "goodbye"),
... ]
>>>
>>> get_match(pairs, Pt(2.0))
'goodbye'
>>> get_match(pairs, Pt(2.000000000001))
'goodbye'

This works, because now we are using an inexact closeness test to find the match. But we have an O(n) algorithm, which isn't great. Also, there's no way to define __hash__ to match that __eq__ test, so our points are no longer hashable.

Trying to make things near each other be equal naturally brings rounding to mind. Maybe that could work? Let's define __eq__ and __hash__ based on rounding the value:

>>> class Pt(namedtuple("Pt", "x")):
...     def __eq__(self, other):
...         return round(self.x, 6) == round(other.x, 6)
...     def __hash__(self):
...         return hash(round(self.x, 6))
...
>>> d = {}
>>> d[Pt(1.0)] = "apple"
>>> d[Pt(1.0)]
'apple'
>>> d[Pt(1.00000001)]
'apple'

Nice! We have matches based on values being close enough, and we can use a dict to get O(1) behavior. But:

>>> d[Pt(1.000000499999)] = "cat"
>>> d[Pt(1.000000500001)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: Pt(x=1.000000500001)

Because rounding has sharp edges (ironically!), there will always be pairs of numbers arbitrarily close together that will round away from each other.

So rounding was close, but not good enough. We can't guarantee that we won't have these pairs of numbers that are in just the wrong place so that rounding will give the wrong answer.

I thought about other data structures: bisection in a sorted list sounded good (O(log n)), but that requires testing for less-than, and I wasn't sure how the fuzziness would affect it. Reading about other geometric algorithms lead to things like k-d trees, which seem exotic and powerful, but I didn't see how to put them to use here.

Then early one morning when I wasn't even trying to solve the problem, I had an idea: round the values two different ways. The problem with rounding was that it failed for values that straddled the rounding boundary. We can round twice, the second time with half the rounding window added, so the two are half out of phase. That way, if the first rounding separates the values, the second rounding won't.

We can't use double-rounding to produce a canonical value for comparisons and hashes, so instead we make a new data structure. We'll use a dict to map points to values, just as we did originally. A second dict will map rounded points to the original points. When we look up a point, if it isn't in the first dict, then round it two ways, and see if either is in the second dict. If it is, then we know what original point to use in the first dict.

That's a lot of words. Here's some code:

class Pt(namedtuple("Pt", "x")):
    # no need for special __eq__ or __hash__
    pass

class PointMap:
    def __init__(self):
        self._items = {}
        self._rounded = {}

    ROUND_DIGITS = 6
    JITTERS = [0, 0.5 * 10 ** -ROUND_DIGITS]

    def _round(self, pt, jitter):
        return Pt(round(pt.x + jitter, ndigits=self.ROUND_DIGITS))

    def __getitem__(self, pt):
        val = self._items.get(pt)
        if val is not None:
            return val

        for jitter in self.JITTERS:
            pt_rnd = self._round(pt, jitter)
            pt0 = self._rounded.get(pt_rnd)
            if pt0 is not None:
                return self._items[pt0]

        raise KeyError(pt)

    def __setitem__(self, pt, val):
        self._items[pt] = val
        for jitter in self.JITTERS:
            pt_rnd = self._round(pt, jitter)
            self._rounded[pt_rnd] = pt

This is now O(1), wonderful! I was surprised that I couldn't find something like this already explained somewhere. Perhaps finding values like I need to do is unusual? Or this is well-known, but under some name I couldn't find? Or is it broken in some way I can't see?

Triangular Fibonacci numbers

Saturday 17 June 2017

Yesterday in my post about 55, I repeated Wikipedia's claim that 55 is the largest number that is both triangular and in the Fibonacci sequence. Chris Emerson commented to ask for a proof. After a moment's thought, I realized I had no idea how to prove it.

The proof is in On Triangular Fibonacci Numbers, a dense 10-page excursion into number theory I don't understand.

While I couldn't follow the proof, I can partially test the claim empirically, which leads to fun with Python and itertools, something which is much more in my wheelhouse.

I started by defining generators for triangular numbers and Fibonacci numbers:

def tri():
    """Generate an infinite sequence of triangular numbers."""
    n = 0
    for i in itertools.count(start=1):
        n += i
        yield n
        
print(list(itertools.islice(tri(), 50)))
[1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78, 91, 105, 120, 136, 153, 171,
 190, 210, 231, 253, 276, 300, 325, 351, 378, 406, 435, 465, 496, 528, 561,
 595, 630, 666, 703, 741, 780, 820, 861, 903, 946, 990, 1035, 1081, 1128,
 1176, 1225, 1275]
def fib():
    """Generate an infinite sequence of Fibonacci numbers."""
    a, b = 1, 1
    while True:
        yield a
        b, a = a, a+b
        
print(list(itertools.islice(fib(), 50)))
[1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584,
 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811,
 514229, 832040, 1346269, 2178309, 3524578, 5702887, 9227465, 14930352,
 24157817, 39088169, 63245986, 102334155, 165580141, 267914296, 433494437,
 701408733, 1134903170, 1836311903, 2971215073, 4807526976, 7778742049,
 12586269025, 20365011074]

The Fibonacci sequence grows much faster!

My first impulse was to make two sets of the numbers in the sequences, and intersect them, but building a very large set took too long. So instead I wrote a function that took advantage of the ever-increasing nature of the sequences to look for equal elements in two monotonic sequences:

def find_same(s1, s2):
    """Find equal elements in two monotonic sequences."""
    try:
        i1, i2 = iter(s1), iter(s2)
        n1, n2 = next(i1), next(i2)
        while True:
            while n1 < n2:
                n1 = next(i1)
            if n1 == n2:
                yield n1
                n1 = next(i1)
            while n2 < n1:
                n2 = next(i2)
            if n1 == n2:
                yield n1
                n1 = next(i1)
    except StopIteration:
        return

And a function to cut off an infinite sequence once a certain ceiling had been reached:

def upto(s, n):
    """Stop a monotonic sequence once it gets to n."""
    for i in s:
        if i > n:
            return
        yield i

Now I could ask, what values less than quadrillion are in both the triangular numbers and the Fibonacci sequence?:

>>> list(find_same(upto(fib(), 1e15), upto(tri(), 1e15)))
[1, 3, 21, 55]

This doesn't prove the claim for all numbers, but it shows that 55 is the largest number under a quadrillion that is in both sequences.

Another way to do this is to take advantage of the asymmetry in growth rate. The Fibonacci sequence up a quadrillion is only 72 elements. Make that a set, then examine the triangular numbers up to quadrillion and keep the ones that are in the Fibonacci set. And I'm certain there are other techniques too.

I can't explain why, but composable generators please me. This was fun. :)

Re-ruling .rst

Friday 12 May 2017

Sometimes, you need a small job done, and you can write a small Python program, and it does just what you need, and it pleases you.

I have some Markdown files to convert to ReStructured Text. Pandoc does a really good job. But it chooses a different order for heading punctuation than our house style, and I didn't see a way to control it.

But it was easy to write a small thing to do the small thing:

import re
import sys

# The order we want our heading rules.
GOOD_RULES = '#*=-.~'

# A rule is any line of all the same non-word character, 3 or more.
RULE_RX = r"^([^\w\d])\1\1+$"

def rerule_file(f):
    rules = {}
    for line in f:
        line = line.rstrip()
        rule_m = re.search(RULE_RX, line)
        if rule_m:
            if line[0] not in rules:
                rules[line[0]] = GOOD_RULES[len(rules)]
            line = rules[line[0]] * len(line)
        print(line)

rerule_file(sys.stdin)

If you aren't conversant in .rst: there's no fixed order to which punctuation means which level heading. The first rule encountered is heading 1, the next style found is heading 2, and so on.

There might be other ways to do this, but this makes me happy.

Older:

Sat 25:

Https

Even older...