Sunday 9 July 2017

For a 2D geometry project I
needed to find things based on 2D points. Conceptually, I wanted to have a
dict that used pairs of floats as the keys. This would work, except that
floats are inexact, and so have difficulty with equality checking. The
“same” point might be computed in two different ways, giving
*slightly* different values, but I want them to match each other in
this dictionary.

I found a solution, though I’m surprised I didn’t find it described elsewhere.

The challenges have nothing to do with the two-dimensional aspect, and everything to do with using floats, so for the purposes of explaining, I’ll simplify the problem to one-dimensional points. I’ll have a class with a single float in it to represent my points.

First let’s look at what happens with a simple dict:

`>>> from collections import namedtuple`

>>> Pt = namedtuple("Pt", "x")

>>>

>>> d = {}

>>> d[Pt(1.0)] = "hello"

>>> d[Pt(1.0)]

'hello'

>>> d[Pt(1.000000000001)]

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

KeyError: Pt(x=1.000000000001)

As long as our floats are precisely equal, the dict works fine. But as soon as we get two floats that are meant to represent the same value, but are actually slightly unequal, then the dict is useless to us. So a simple dict is no good.

To get the program running at all, I used a dead-simple structure that I knew would work: a list of key/value pairs. By defining __eq__ on my Pt class to compare with some fuzz, I could find matches that were slightly unequal:

`>>> class Pt(namedtuple("Pt", "x")):`

... def __eq__(self, other):

... return math.isclose(self.x, other.x)

...

>>> def get_match(pairs, pt):

... for pt2, val in pairs:

... if pt2 == pt:

... return val

... return None

...

>>> pairs = [

... (Pt(1.0), "hello"),

... (Pt(2.0), "goodbye"),

... ]

>>>

>>> get_match(pairs, Pt(2.0))

'goodbye'

>>> get_match(pairs, Pt(2.000000000001))

'goodbye'

This works, because now we are using an inexact closeness test to find the match. But we have an O(n) algorithm, which isn’t great. Also, there’s no way to define __hash__ to match that __eq__ test, so our points are no longer hashable.

Trying to make things near each other be equal naturally brings rounding to mind. Maybe that could work? Let’s define __eq__ and __hash__ based on rounding the value:

`>>> class Pt(namedtuple("Pt", "x")):`

... def __eq__(self, other):

... return round(self.x, 6) == round(other.x, 6)

... def __hash__(self):

... return hash(round(self.x, 6))

...

>>> d = {}

>>> d[Pt(1.0)] = "apple"

>>> d[Pt(1.0)]

'apple'

>>> d[Pt(1.00000001)]

'apple'

Nice! We have matches based on values being close enough, and we can use a dict to get O(1) behavior. But:

`>>> d[Pt(1.000000499999)] = "cat"`

>>> d[Pt(1.000000500001)]

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

KeyError: Pt(x=1.000000500001)

Because rounding has sharp edges (ironically!), there will always be pairs of numbers arbitrarily close together that will round away from each other.

So rounding was close, but not good enough. We can’t guarantee that we won’t have these pairs of numbers that are in just the wrong place so that rounding will give the wrong answer.

I thought about other data structures: bisection in a sorted list sounded good (O(log n)), but that requires testing for less-than, and I wasn’t sure how the fuzziness would affect it. Reading about other geometric algorithms lead to things like k-d trees, which seem exotic and powerful, but I didn’t see how to put them to use here.

Then early one morning when I wasn’t even trying to solve the problem, I had an idea: round the values two different ways. The problem with rounding was that it failed for values that straddled the rounding boundary. We can round twice, the second time with half the rounding window added, so the two are half out of phase. That way, if the first rounding separates the values, the second rounding won’t.

We can’t use double-rounding to produce a canonical value for comparisons and hashes, so instead we make a new data structure. We’ll use a dict to map points to values, just as we did originally. A second dict will map rounded points to the original points. When we look up a point, if it isn’t in the first dict, then round it two ways, and see if either is in the second dict. If it is, then we know what original point to use in the first dict.

That’s a lot of words. Here’s some code:

`class Pt(namedtuple("Pt", "x")):`

# no need for special __eq__ or __hash__

pass

class PointMap:

def __init__(self):

self._items = {}

self._rounded = {}

ROUND_DIGITS = 6

JITTERS = [0, 0.5 * 10 ** -ROUND_DIGITS]

def _round(self, pt, jitter):

return Pt(round(pt.x + jitter, ndigits=self.ROUND_DIGITS))

def __getitem__(self, pt):

val = self._items.get(pt)

if val is not None:

return val

for jitter in self.JITTERS:

pt_rnd = self._round(pt, jitter)

pt0 = self._rounded.get(pt_rnd)

if pt0 is not None:

return self._items[pt0]

raise KeyError(pt)

def __setitem__(self, pt, val):

self._items[pt] = val

for jitter in self.JITTERS:

pt_rnd = self._round(pt, jitter)

self._rounded[pt_rnd] = pt

This is now O(1), wonderful! I was surprised that I couldn’t find something like this already explained somewhere. Perhaps finding values like I need to do is unusual? Or this is well-known, but under some name I couldn’t find? Or is it broken in some way I can’t see?

## Comments

Allan8:50 AM on 10 Jul 2017Could your class use a single dict, instead of two, if you use the string-representation of the the rounded float for keys? Perhaps the key could be something like:

```

In [20]: p = 1.000001234

In [21]: fmt = '{:6f}{:6f}'

In [22]: mf = 10**6

In [23]: fmt.format(math.ceil(p*mf)/mf, math.floor(p*mf)/mf)

Out[23]: '1.0000021.000001'

```

Guess the problem is very much like a binning-problem, when you construct a histogram in e.g. Numpy... which might be a good source to look for an algorithm for solving the "is this float close enough to that one"-problem?

`mf` dictates the precision with which the float should be stored... or the bin width, if you will.

The `ceil` and the `floor` chooses the "bin boundaries".

While there goes processing time into constructing the key, there's only a single lookup, and less Python code (assuming the string-construction happens in C, which I don't know whether is true or not).

What's your thoughts on this?

Note that I haven't testet the code as thoroughly as your code.

Oddthinking9:41 AM on 10 Jul 2017Two quick typos:

Both of these should be referring toOddthinking9:51 AM on 10 Jul 2017Consider:

As expected, this fails with a KeyError because the value being looked up is too far away from the first - it is double the epsilon/jitter value.But now replace the 0.0999999999999 with 0.1 and run it again. It should behave the same way - not only are the initial values very, very close, but the value being looked up is still the same distance away from the value being put in the collection.

However, now it decides the two numbers

areclose enough together to be a match.This inconsistency may or may not be acceptable to you.

Ned Batchelder2:05 PM on 10 Jul 2017@Oddthinking: thanks for the typo-spotting. I've fixed them in the post.

Thanks for the detailed analysis. It's true that the sensitivity of the grid depends on where exactly on the grid the point falls. Two points up to 2.999999*epsilon could be paired together if they are at just the right place.

Ned Batchelder2:07 PM on 10 Jul 2017Oh, and I'm fine with that variation. The epsilon I'm using in the real program is much much smaller than the expected distance between truly distinct points.

Ned Batchelder2:17 PM on 10 Jul 2017@Allan: I don't go into it in this blog post, but I wanted this object to act as much as possible like a dict with points as keys. That includes being able to iterate it to see all the points. So I wanted to keep the points as the keys in _items. This is also why there are two dicts: so that dunder methods like __iter__, __contains__, __len__, etc, can be implemented on top of _items, and _rounded is there just to support the fuzzy finding.

So I haven't thought about other tricks I could do with clever keys.

Nethanel Elzas7:33 PM on 10 Jul 2017I'm not sure if the decimal module (https://docs.python.org/3.4/library/decimal.html) would solve your exact problem, but I thought it can be useful for others problems with floats.

Ned Batchelder8:05 PM on 10 Jul 2017@Nethanel: Decimals don't help with irrational numbers, or with computations that should result in the same value, but have slightly different results because they arrive at the value in different ways.

Emile Snyder10:40 PM on 20 Jul 2017It seems like it doesn't matter for your case, but you are deliberately defining equal in such a way that you can have 3 keys, A, B and C, where A == B, and B == C, but A != C, right?

Ned Batchelder1:42 AM on 21 Jul 2017@Emile, yes, the definition of __eq__ is odd: it doesn't define equivalence classes, which is why there's no corresponding definition of __hash__, and therefore why we can't just use these as keys in a dict.

## Add a comment: