Is it a cache?

Friday 21 March 2008This is close to 17 years old. Be careful.

A current discussion on django-developers about identity mapping raised a sore point with me: when is something a cache? In the linked message, Philippe Raoult discussing the id/object map as a cache. In my mind, it is clear that it is not a cache.

There are two defining characteristics of a cache:

  • It makes things go faster, and
  • It doesn’t make things behave differently.

An important test of cache-ness is to imagine disabling it. How does that affect the whole system? If you answered “it goes slower”, then you have a cache. If you answered, “some stuff wouldn’t work right”, then you are talking about something besides a cache.

Of the two traits above, counter-intuitively, the second is more important. If I have a component which is transparent to the rest of the system, and was supposed to make things faster, but doesn’t, then I have a bad cache. If I have a component which speeds things up but cannot be disabled because it would break things, then I have no cache at all.

Another role sometimes confused with caching is persistence. The defining characteristic of persistence is that you can give it a thing, and later it will give it back to you. That’s its promise. Caches make no such promise. In fact, there is a term for when a cache doesn’t have what you are looking for: a cache miss.

When evaluating a component, you need to keep a clear eye on what you are using it for. A tool built for caching cannot be slotted into the persistence role. The cache contract doesn’t include the promise to return data. No matter how infrequently your cache misses, it still misses. Similarly, if the component in question is important to the semantics of the system, then it isn’t a cache, because it breaks the promise to not make things behave differently.

Comments

[gravatar]
I used to argue very strongly that an identity map is not a cache at all. More recently I've started saying something along the lines of "well its slightly cache-like but really not" - the wishy-washy language designed to make the same basic point (don't expect it to act like a cache) while avoiding needless time-consuming mailing list arguments. I made this adjustment after going back and re-reading Fowler on this, where he actually does use the term "cache" to describe the identity map in several places.
[gravatar]
I guess there are two possibilities here:

1) Fowler was wrong (!!!), or

2) He meant cache in the looser sense of "a smaller faster store that we access instead of the bigger slower store", which I guess is just a different form of Fowler being wrong.
[gravatar]
I'm afraid I disagree with your definition. Maybe it's because I deal with very different caches than you do - e.g. CPU caches, TLBs, block/page caches, etc. From my perpective, a cache is defined as a location for data that is faster to access than the authoritative location would be. Caches may or may not change behavior, and some of the ones I work with do. In fact, I'll bet you've used a processor with split instruction/data caches. This kind of cache can very much change behavior as far as the actual cache user (an OS or possibly compiler writer) is concerned even if it becomes invisible again at higher levels. Similarly, registers can be considered as caches of memory locations, and we have language constructs like C "volatile" to deal with the behavioral difference.

I agree, though, that caching is orthogonal to persistence and the two should not be confused. The distinction of whether you can miss in cache is a bit misleading, I think, because some systems are built so the main data store can miss too. I guess you could call those "cache-only" architectures (and KSR did exactly that with their Cache-Only Memory Architecture) but I think "distributed" would be a more accurate term. The fact that caches miss more often, like the fact that they might return stale data, is less a defining characteristic of caches than an emergent property of them having been built for speed at the expense of dependability.

No matter which definition you prefer, though, I think people need to follow an important rule for dealing with caches: never cache anything unless you have a strategy for how to deal with consistency issues. "Do nothing and live with stale data" is a valid strategy if it's a conscious decision, but forgetting or ignoring the possibility that your copy might get out of sync with others is a source of very many difficult-to-fix-later bugs.
[gravatar]
Jeff, for the sake of brevity, I skipped the possibility that disabling the cache would cause the system not to work right because the system depends on the timing of retrieving the data. Or maybe I'm missing your point about split instruction/data caches.

To your point about main data stores missing: I don't have any experience with them, so either you are talking about misses that are hidden from the user, so they can be treated as fulfilling the promise of persistence, or the user of that store needs to account for the possibility of the miss. I would count this as a specialized kind of persistence, and in any case doesn't affect the definition of cache, it just means that other things besides caches can have misses.

And you last paragraph is important, and gets at my central point: when dealing with a cache, misses are always a possibility. If you treat the cache as if it were persistence ("it will always return my data"), then you aren't doing it right.
[gravatar]
A split I/D cache is a case where the behavior changes in such a way that some might claim it doesn't work right with the cache. For example, on some architectures - including the one I work on - self-modifying code can go into the D-cache but then won't be fetched back through the I-cache unless extra steps are taken. I recently had to find and fix a bug like that involving a distributed filesystem that wasn't flushing the D-cache properly on reads, which showed up as programs run from that filesystem occasionally getting "illegal instruction" errors even though the subsequently-inspected instruction looked fine. More to the point, adding a cache in this case can changed behavior - something your second rule would preclude, but I'm not going to walk down the hall and tell a CPU designer that what he built is not a cache. I think your "will it break if the cache is removed" test still kind of works, though, unless you consider the possibility that the main data store is defective and the cache is hiding or correcting for the defect. In such cases - and the timing issue you raise is one example - then you might have something that is both a cache and a corrective. I don't see any reason it has to be exclusively one or the other.

If you treat the cache as if it were persistence ("it will always return my data"), then you aren't doing it right.

Absolutely. I would also say that if you treat any other node in a distributed system as if it provided persistence then you're also not doing it right, but that's a whole 'nother conversation. ;)
[gravatar]
My definition of cache matches yours. The first characteristic determines whether you really need a cache or not -- is it faster with a cache or not? The second determines whether you got the caching code right. Stale or invalid cache entries can introduce subtle bugs. For example, an entry gets added before a transaction commits, the transaction rolls back but the cache entry isn't removed.
[gravatar]
I have to disagree with your definition of cache based on two points:


1. Caches aren't always implemented for the purpose of making things go faster. There are other resources, such as bandwidth, caches also conserve. While a side effect is that caches often make things faster because the conserved resource is usually some sort of bottleneck, that isn't always the purpose of the cache. Cost savings and reliability can be reasons to implement a cache as well.


2. You're defining characteristics, if we accept them as they are, are too broad to serve alone as a definition. What those characteristics really define is a better algorithm.


What you're missing is that cache is a trade-off by definition. In computer science when we talk about cache we typically talk about trading space for locality. Hence, a cache in software might keep some data in memory so it doesn't have to access mass storage devices or the network, but it comes at the cost of increased memory use.
[gravatar]
Cost savings and reliability can be reasons to implement a cache as well.
Can you think of a real-world example? In general, the reliability of a system goes down when you add components. The exception is systems which are specifically designed for high availability and which often rely heavily on replication. Replication isn't the same as caching, though. In fact, I think that was sort of Ned's point.

As far as cost, I think a better case can be made. The cache itself costs resources and thus money, of course, but a fatter pipe (whether it's between components within a system or between systems across the world) might cost even more and that's often the tradeoff people must make to meet a specific performance goal. OTOH, I don't know if I'd say that's often the reason people implement caches. More often, cost is a constant and performance is a variable (vs. the other way around) and caching is the only way to increase performance within the same cost constraint.
[gravatar]
Can you think of a real-world example? In general, the reliability of a system goes down when you add components.

That only applies when the reliable part is within your control. Take DNS, for example. Have you ever had your DNS go down, but still been able to access website you've recently gone to? That's not a case of replication, but an instance of your browser or operating system caching ip addresses.

Another case is with streaming video, which has two methods of increasing reliability. The first is to drop packets and buffer to ensure that the stream is uninterrupted, and the second is to keep the entirety of the video data while you're playing it so that if you skip backwards to rewatch something it doesn't have to buffer again. This wouldn't be needed if the internet was reliable.


I think that you'd be surprised just how often caching is implemented to save money. When it comes to resources like databases or servers, there are two main ways to increase capacity to handle requests: replication and caching. While replication is more reliable than caching and offers better performance, it costs a hell of a lot more. Mind you, I'm not talking hardware costs. In the world of servers, heat costs far more than the hardware itself, so getting caching servers with 64GB of RAM and no persistent storage costs a lot less than replicating a server since disks cost more in terms of power than RAM.

Another, oft overlooked area where cost is saved by implementing cache is in the browser, especially on mobile devices. While in many places flat rates for bandwidth are becoming more common, it's still the case that people often get charged by the KB for data transfer. If images weren't cached by browsers the cost would be much greater than it is now.
[gravatar]
Can you think of a real-world example? In general, the reliability of a system goes down when you add components.


Not really. For example: three cpu's and a voting system can be very much more reliable than a single cpu - triple redundancy.

- Paddy.
[gravatar]
As I said, Paddy, the exception is systems that are specifically designed for high availability. Back when I was working on one of the early UNIX high-availability systems, I even defined "high availability" as "a system in which reliability increases with component count instead of decreasing" in classes. It's a whole different design methodology, and one that is far less common than the "built for speed and who cares if it falls over" approach that dominates the market.

Brendan: good examples, but I still think such cases are a minority. Yes, caching can sometimes help to mask or recover from failures, but I could match every example of that with an actual bug I've had to fix where a cache was the cause of the failure. I also think that using endpoint/client caches as examples is a bit suspect. Having something in my DNS or browser cache might protect me from a failure, but it doesn't make the system - i.e. the collection of DNS or web servers - more reliable. It doesn't help anybody else; they can't come and get the data from me when the authoritative servers are down. That cache can also turn out to be harmful if a host moves or content changes. Have you never waited for a DNS update to propagate, or had to hit your browser's "refresh" button?

The actual systems in your examples achieve increasing reliability by adding components with replication and separation of concerns (e.g. delegating responsibility for a subdomain), independently of whatever caching might also be involved. Copying data from one location to another can serve both caching and replication purposes, but that doesn't mean the two purposes are the same. It's usually easy to make a replica serve dual duty as a cache, which might lead the naive to misperceive it as only or primarily a cache, but this non-equivalence can become painfully apparent when you try to treat a "pure cache" - i.e. one added for its own sake and not as a convenient extension of preexisting replication - as a replica. Caching in and of itself, distinct from replication, has a net negative effect on reliability.
[gravatar]
I disagree on both counts. Who says a cache has to be faster? It's just a first place to look. If I can retrieve the information I am interested in, wihtout alerting the authorities, (well, not that I would be ever in such as position) that might be a good thing, no? Regardless of how long it takes, you know?

As far as not making things behave differently is that not a function of the consumer of the cache? If the cache does not behave differently on a miss vs. a hit, I got problems, eh?
[gravatar]
So how would you classify Akamai Networks' Edge Computing service? It's like Akamai CDNing, but rather than placing a big file server strategically in an eyeball network, it places a big application server there, with (presumably) some fancy logic behind the scenes to keep them all coherent. The first DNS record in the local DNS points at the Akamai box, but the second one points at your box back at HQ to permit failover.

Now, it's not a cache in the sense that rather than just stockpiling data, it actually carries out program logic; but it is there to make everything work faster (and more reliably), and if it's been correctly implemented, taking away the Akamai box shouldn't make things go weird, just slower.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.