Sunday 9 July 2017 — This is over seven years old. Be careful.
For a 2D geometry project I needed to find things based on 2D points. Conceptually, I wanted to have a dict that used pairs of floats as the keys. This would work, except that floats are inexact, and so have difficulty with equality checking. The “same” point might be computed in two different ways, giving slightly different values, but I want them to match each other in this dictionary.
I found a solution, though I’m surprised I didn’t find it described elsewhere.
The challenges have nothing to do with the two-dimensional aspect, and everything to do with using floats, so for the purposes of explaining, I’ll simplify the problem to one-dimensional points. I’ll have a class with a single float in it to represent my points.
First let’s look at what happens with a simple dict:
>>> from collections import namedtuple
>>> Pt = namedtuple("Pt", "x")
>>>
>>> d = {}
>>> d[Pt(1.0)] = "hello"
>>> d[Pt(1.0)]
'hello'
>>> d[Pt(1.000000000001)]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: Pt(x=1.000000000001)
As long as our floats are precisely equal, the dict works fine. But as soon as we get two floats that are meant to represent the same value, but are actually slightly unequal, then the dict is useless to us. So a simple dict is no good.
To get the program running at all, I used a dead-simple structure that I knew would work: a list of key/value pairs. By defining __eq__ on my Pt class to compare with some fuzz, I could find matches that were slightly unequal:
>>> class Pt(namedtuple("Pt", "x")):
... def __eq__(self, other):
... return math.isclose(self.x, other.x)
...
>>> def get_match(pairs, pt):
... for pt2, val in pairs:
... if pt2 == pt:
... return val
... return None
...
>>> pairs = [
... (Pt(1.0), "hello"),
... (Pt(2.0), "goodbye"),
... ]
>>>
>>> get_match(pairs, Pt(2.0))
'goodbye'
>>> get_match(pairs, Pt(2.000000000001))
'goodbye'
This works, because now we are using an inexact closeness test to find the match. But we have an O(n) algorithm, which isn’t great. Also, there’s no way to define __hash__ to match that __eq__ test, so our points are no longer hashable.
Trying to make things near each other be equal naturally brings rounding to mind. Maybe that could work? Let’s define __eq__ and __hash__ based on rounding the value:
>>> class Pt(namedtuple("Pt", "x")):
... def __eq__(self, other):
... return round(self.x, 6) == round(other.x, 6)
... def __hash__(self):
... return hash(round(self.x, 6))
...
>>> d = {}
>>> d[Pt(1.0)] = "apple"
>>> d[Pt(1.0)]
'apple'
>>> d[Pt(1.00000001)]
'apple'
Nice! We have matches based on values being close enough, and we can use a dict to get O(1) behavior. But:
>>> d[Pt(1.000000499999)] = "cat"
>>> d[Pt(1.000000500001)]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: Pt(x=1.000000500001)
Because rounding has sharp edges (ironically!), there will always be pairs of numbers arbitrarily close together that will round away from each other.
So rounding was close, but not good enough. We can’t guarantee that we won’t have these pairs of numbers that are in just the wrong place so that rounding will give the wrong answer.
I thought about other data structures: bisection in a sorted list sounded good (O(log n)), but that requires testing for less-than, and I wasn’t sure how the fuzziness would affect it. Reading about other geometric algorithms lead to things like k-d trees, which seem exotic and powerful, but I didn’t see how to put them to use here.
Then early one morning when I wasn’t even trying to solve the problem, I had an idea: round the values two different ways. The problem with rounding was that it failed for values that straddled the rounding boundary. To avoid that problem we can round twice, the second time with half the rounding window added, so the two are half out of phase. That way if the first rounding separates the values, the second rounding won’t.
We can’t use double-rounding to produce a canonical value for comparisons and hashes, so instead we make a new data structure. We’ll use a dict to map points to values, just as we did originally. A second dict will map rounded points to the original points. When we look up a point, if it isn’t in the first dict, then round it two ways, and see if either is in the second dict. If it is, then we know what original point to use in the first dict.
That’s a lot of words. Here’s some code:
class Pt(namedtuple("Pt", "x")):
# no need for special __eq__ or __hash__
pass
class PointMap:
def __init__(self):
self._items = {}
self._rounded = {}
ROUND_DIGITS = 6
JITTERS = [0, 0.5 * 10 ** -ROUND_DIGITS]
def _round(self, pt, jitter):
return Pt(round(pt.x + jitter, ndigits=self.ROUND_DIGITS))
def __getitem__(self, pt):
val = self._items.get(pt)
if val is not None:
return val
for jitter in self.JITTERS:
pt_rnd = self._round(pt, jitter)
pt0 = self._rounded.get(pt_rnd)
if pt0 is not None:
return self._items[pt0]
raise KeyError(pt)
def __setitem__(self, pt, val):
self._items[pt] = val
for jitter in self.JITTERS:
pt_rnd = self._round(pt, jitter)
self._rounded[pt_rnd] = pt
This is now O(1), wonderful! I was surprised that I couldn’t find something like this already explained somewhere. Perhaps finding values like I need to do is unusual? Or this is well-known, but under some name I couldn’t find? Or is it broken in some way I can’t see?
Comments
```
In [20]: p = 1.000001234
In [21]: fmt = '{:6f}{:6f}'
In [22]: mf = 10**6
In [23]: fmt.format(math.ceil(p*mf)/mf, math.floor(p*mf)/mf)
Out[23]: '1.0000021.000001'
```
Guess the problem is very much like a binning-problem, when you construct a histogram in e.g. Numpy... which might be a good source to look for an algorithm for solving the "is this float close enough to that one"-problem?
`mf` dictates the precision with which the float should be stored... or the bin width, if you will.
The `ceil` and the `floor` chooses the "bin boundaries".
While there goes processing time into constructing the key, there's only a single lookup, and less Python code (assuming the string-construction happens in C, which I don't know whether is true or not).
What's your thoughts on this?
Note that I haven't testet the code as thoroughly as your code.
But now replace the 0.0999999999999 with 0.1 and run it again. It should behave the same way - not only are the initial values very, very close, but the value being looked up is still the same distance away from the value being put in the collection.
However, now it decides the two numbers are close enough together to be a match.
This inconsistency may or may not be acceptable to you.
Thanks for the detailed analysis. It's true that the sensitivity of the grid depends on where exactly on the grid the point falls. Two points up to 2.999999*epsilon could be paired together if they are at just the right place.
So I haven't thought about other tricks I could do with clever keys.
Add a comment: