Ordered dict surprises

Monday 12 October 2020

Since Python 3.6, regular dictionaries retain their insertion order: when you iterate over a dict, you get the items in the same order they were added to the dict. Before 3.6, dicts were unordered: the iteration order was seemingly random.

Here are two surprising things about these ordered dicts.

You can’t get the first item

Since the items in a dict have a specific order, it should be easy to get the first (or Nth) item, right? Wrong. It’s not possible to do this directly. You might think that d[0] would be the first item, but it’s not, it’s the value of the key 0, which could be the last item added to the dict.

The only way to get the Nth item is to iterate over the dict, and wait until you get to the Nth item. There’s no random access by ordered index. This is one place where lists are better than dicts. Getting the Nth element of a list is an O(1) operation. Getting the Nth element of a dict (even if it is ordered) is an O(N) operation.

OrderedDict is a little different

If dicts are ordered now, collections.OrderedDict is useless, right? Well, maybe. It won’t be removed because that would break code using that class, and it has some methods that regular dicts don’t. But there’s also one subtle difference in behavior. Regular dicts don’t take order into account when comparing dicts for equality, but OrderedDicts do:

>>> d1 = {"a": 1, "b": 2}
>>> d2 = {"b": 2, "a": 1}
>>> d1 == d2
True
>>> list(d1)
['a', 'b']
>>> list(d2)
['b', 'a']

>>> from collections import OrderedDict
>>> od1 = OrderedDict([("a", 1), ("b", 2)])
>>> od2 = OrderedDict([("b", 2), ("a", 1)])
>>> od1 == od2
False
>>> list(od1)
['a', 'b']
>>> list(od2)
['b', 'a']
>>>

BTW, this post is the result of a surprisingly long and contentious discussion in the Python Discord.

Comments

[gravatar]
This seems to get back the first item in a dictionary reasonably efficiently, am I missing something here?
first_item = next(iter(d))
[gravatar]
Yes, that works, but it counts as "iterate over the dict, and wait until you get to the Nth item." Since it's the first item, you don't have to wait long! :)
[gravatar]
Ah gotcha - I was thinking O(N) in terms of size of the dictionary, not how-far-along-you-need-to-go - so I was thinking that accessing the first item was O(1) and any item after that was O(N).
[gravatar]
The second gotcha is the source of even more surprising and unfortunate behavior, namely that OrderedDict will give you non-transitive equality :/...

Specifically:

OrderedDict([("a", 1), ("b", 2)]) == {"a": 1, "b": 2} == OrderedDict([("b", 2), ("a", 1)])

but OrderedDict([("a", 1), ("b", 2)]) != OrderedDict([("b", 2), ("a", 1)])
[gravatar]
It's not that difficult, actually.
d[tuple(d.keys())[0])
[gravatar]
@Mikhail It's a simple line of code, but it's iterating the dict, as I mentioned. If your dict has 1000 elements, you will create a 1000-element tuple and then take the first element.
[gravatar]
Are you sure that converting dict.keys to tuple involves iteration? Please demonstrate.
[gravatar]
According to the docs dict.keys() returns a PyListObject, so converting it to tuple doesn't involve iteration.
[gravatar]
Yes, it involves iteration. Try it: you get a 1000-element tuple. That means it must be an O(N) operation.
[gravatar]
And to answer your 1000 keys question here's what I ran on my PHONE. The result was 0.0466268063 seconds for a.dict with 1 MILLION keys
import time

d = {f"key_{i}": f"value_{i}" for i in range(1000000)}

t1 = time.time()
got_value = d[tuple(d.keys())[500]]
t2 = time.time()

exec_time = t2 - t1
print(f"It took {exec_time:.10f} seconds for a.dict with {len(d)} keys")

>>>It took 0.0466268063 seconds for a.dict with 1000000 keys

[Program finished]
[gravatar]
Well, it involves as much iteration, as just looking up a value by key ?
[gravatar]
Just admit you didn't think of my way ?
[gravatar]
Ooh, I found a better way, stupid me :)

list(d)[500]
[gravatar]
@Mikhail: please stop. All of these involve iterating over the dictionary. Looking up a value by key is an O(1) operation. Your ways are all O(N). They may be very fast, but they are still O(N).
[gravatar]
Ok, according to your own logic, I quote the second comment from the top written by you:
"Yes, that works, but it counts as "iterate over the dict, and wait until you get to the Nth item." Since it's the first item, you don't have to wait long! :) "
So, let's check that. According to you getting the 999,999th item in the dictionary should take at least a little longer than getting the 0th item in the dictionary.

So I checked that. And to make sure, that I checked it right I did 100 runs and aggregated the results. Here they are. Left column is the index of the value that we're getting. Right column: mean of the time it took to get it.
          
0             0.018350
1000          0.018611
10000         0.018211
100000        0.018274
500000        0.018855
999999        0.018140
As you can see, there is no difference in retrieval time. And for once it would be nice to hear some argumentation from your side, apart from "oh it's iterating" and "please stop". Here's my code used for testing:
from time import time
import pandas as pd

dict_size = 1_000_000
experimental_dict = {f"key_{i}": f"value_{i}" for i in range(dict_size)}

check_index = [0, 1_000, 10_000, 100_000, 500_000, 999_999]
result = []
for attempt in range(100):
    for i in check_index:
        t1 = time()
        value = experimental_dict[tuple(experimental_dict.keys())[i]]
        t2 = time()
        result.append(
            {
                "i": i,
                "execution_time": t2 - t1
            }
        )

df = pd.DataFrame(result).groupby(by="i").mean()
By the way, just for the laughs, I did the same test on a dictionary with TEN MILLION records. That's a lot of records for a simple dict, but who cares. And guess what: it took exactly the same amount of time to retrieve the 10-millionth record as it took to retrieve the 0th.
0              0.183136
1000           0.181712
10000          0.182742
100000         0.182070
500000         0.182609
999999         0.182580
9999999        0.182191
If in some way I'm wrong, and I just fail to see the iteration and growing execution time -- by all means enlighten me, and I will admit I was wrong. Otherwise be a gentleman ;)
[gravatar]
In this case, it's O(N) where N isn't the index you are looking for, but the size of the dict you are looking at.
import time

for exp in range(3, 8):
    num = 10 ** exp
    d = {f"key_{i}": f"value_{i}" for i in range(num)}

    t1 = time.time()
    got_value = d[tuple(d.keys())[500]]
    t2 = time.time()

    exec_time = t2 - t1
    print(f"It took {exec_time:.10f} seconds for a.dict with {len(d)} keys")
This produces:
It took 0.0000109673 seconds for a.dict with 1000 keys
It took 0.0002467632 seconds for a.dict with 10000 keys
It took 0.0029292107 seconds for a.dict with 100000 keys
It took 0.0338559151 seconds for a.dict with 1000000 keys
It took 0.4842569828 seconds for a.dict with 10000000 keys
Each time the dict grew ten times larger, and the time took ten times longer. O(N).
[gravatar]
The reason it's O(N) in the size of the dict is that you are iterating over the entire dict in order to build your list or tuple.

Dicts don't offer numeric indexing, you have to iterate them to get it. The N could either mean the index you are looking for, if you iterate until you find what you want, or it could mean the size of the dict, if you iterate the entire dict as you are doing.

Your experiment demonstrates that once you have a list or tuple, accessing the element by index is a constant-time operation regardless of the index. This is the exact point being made in the post: lists are better than dicts at random access by numeric index. Your timing experiment shows that. But you've paid the price of iterating the entire dict first, and that total iteration is O(N) in the size of the dict.
[gravatar]
Thanks for a proper response, I appreciate that. I'm convinced, you're right. The time it takes to convert dict to list increases the longer the dictionary is.

Even if there is iteration, though, it's handled on the level of Python's C engine, so in practicality it won't matter. Especially since you can cache the result of
tuple(dict.keys())
if you need to acces items by index.

But I agree that you're better off with a list for that.
[gravatar]
Alexander Loftus 3:57 PM on 15 Oct 2020
Mikhael, just for the record, my reaction to reading these posts is that you were being fairly obnoxious.
[gravatar]
Thanks for sharing, Alexander
[gravatar]
It's worth considering why accessing the first/last element is interesting. Sometimes, it's because we've expected/proven the collection to only have one element, and we want that element uniquely. Other times, it's because we're performing some iteration, and we want the first/last instance of the iteration to have special behavior.
[gravatar]
Mikhael, just for the record, my reaction to reading these posts is that you were keen, involving, and gracious when Ned could finally convince you. Your code showed that you had worked on your point rather than just mouthing off. Ned took the time to enlighten you. And I had a lesson in good debates.
[gravatar]
Thank you Paddy, it's very kind of you!

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.