Friday 21 March 2008 — This is close to 17 years old. Be careful.
A current discussion on django-developers about identity mapping raised a sore point with me: when is something a cache? In the linked message, Philippe Raoult discussing the id/object map as a cache. In my mind, it is clear that it is not a cache.
There are two defining characteristics of a cache:
- It makes things go faster, and
- It doesn’t make things behave differently.
An important test of cache-ness is to imagine disabling it. How does that affect the whole system? If you answered “it goes slower”, then you have a cache. If you answered, “some stuff wouldn’t work right”, then you are talking about something besides a cache.
Of the two traits above, counter-intuitively, the second is more important. If I have a component which is transparent to the rest of the system, and was supposed to make things faster, but doesn’t, then I have a bad cache. If I have a component which speeds things up but cannot be disabled because it would break things, then I have no cache at all.
Another role sometimes confused with caching is persistence. The defining characteristic of persistence is that you can give it a thing, and later it will give it back to you. That’s its promise. Caches make no such promise. In fact, there is a term for when a cache doesn’t have what you are looking for: a cache miss.
When evaluating a component, you need to keep a clear eye on what you are using it for. A tool built for caching cannot be slotted into the persistence role. The cache contract doesn’t include the promise to return data. No matter how infrequently your cache misses, it still misses. Similarly, if the component in question is important to the semantics of the system, then it isn’t a cache, because it breaks the promise to not make things behave differently.
Comments
1) Fowler was wrong (!!!), or
2) He meant cache in the looser sense of "a smaller faster store that we access instead of the bigger slower store", which I guess is just a different form of Fowler being wrong.
I agree, though, that caching is orthogonal to persistence and the two should not be confused. The distinction of whether you can miss in cache is a bit misleading, I think, because some systems are built so the main data store can miss too. I guess you could call those "cache-only" architectures (and KSR did exactly that with their Cache-Only Memory Architecture) but I think "distributed" would be a more accurate term. The fact that caches miss more often, like the fact that they might return stale data, is less a defining characteristic of caches than an emergent property of them having been built for speed at the expense of dependability.
No matter which definition you prefer, though, I think people need to follow an important rule for dealing with caches: never cache anything unless you have a strategy for how to deal with consistency issues. "Do nothing and live with stale data" is a valid strategy if it's a conscious decision, but forgetting or ignoring the possibility that your copy might get out of sync with others is a source of very many difficult-to-fix-later bugs.
To your point about main data stores missing: I don't have any experience with them, so either you are talking about misses that are hidden from the user, so they can be treated as fulfilling the promise of persistence, or the user of that store needs to account for the possibility of the miss. I would count this as a specialized kind of persistence, and in any case doesn't affect the definition of cache, it just means that other things besides caches can have misses.
And you last paragraph is important, and gets at my central point: when dealing with a cache, misses are always a possibility. If you treat the cache as if it were persistence ("it will always return my data"), then you aren't doing it right.
If you treat the cache as if it were persistence ("it will always return my data"), then you aren't doing it right.
Absolutely. I would also say that if you treat any other node in a distributed system as if it provided persistence then you're also not doing it right, but that's a whole 'nother conversation. ;)
1. Caches aren't always implemented for the purpose of making things go faster. There are other resources, such as bandwidth, caches also conserve. While a side effect is that caches often make things faster because the conserved resource is usually some sort of bottleneck, that isn't always the purpose of the cache. Cost savings and reliability can be reasons to implement a cache as well.
2. You're defining characteristics, if we accept them as they are, are too broad to serve alone as a definition. What those characteristics really define is a better algorithm.
What you're missing is that cache is a trade-off by definition. In computer science when we talk about cache we typically talk about trading space for locality. Hence, a cache in software might keep some data in memory so it doesn't have to access mass storage devices or the network, but it comes at the cost of increased memory use.
Can you think of a real-world example? In general, the reliability of a system goes down when you add components. The exception is systems which are specifically designed for high availability and which often rely heavily on replication. Replication isn't the same as caching, though. In fact, I think that was sort of Ned's point.
As far as cost, I think a better case can be made. The cache itself costs resources and thus money, of course, but a fatter pipe (whether it's between components within a system or between systems across the world) might cost even more and that's often the tradeoff people must make to meet a specific performance goal. OTOH, I don't know if I'd say that's often the reason people implement caches. More often, cost is a constant and performance is a variable (vs. the other way around) and caching is the only way to increase performance within the same cost constraint.
That only applies when the reliable part is within your control. Take DNS, for example. Have you ever had your DNS go down, but still been able to access website you've recently gone to? That's not a case of replication, but an instance of your browser or operating system caching ip addresses.
Another case is with streaming video, which has two methods of increasing reliability. The first is to drop packets and buffer to ensure that the stream is uninterrupted, and the second is to keep the entirety of the video data while you're playing it so that if you skip backwards to rewatch something it doesn't have to buffer again. This wouldn't be needed if the internet was reliable.
I think that you'd be surprised just how often caching is implemented to save money. When it comes to resources like databases or servers, there are two main ways to increase capacity to handle requests: replication and caching. While replication is more reliable than caching and offers better performance, it costs a hell of a lot more. Mind you, I'm not talking hardware costs. In the world of servers, heat costs far more than the hardware itself, so getting caching servers with 64GB of RAM and no persistent storage costs a lot less than replicating a server since disks cost more in terms of power than RAM.
Another, oft overlooked area where cost is saved by implementing cache is in the browser, especially on mobile devices. While in many places flat rates for bandwidth are becoming more common, it's still the case that people often get charged by the KB for data transfer. If images weren't cached by browsers the cost would be much greater than it is now.
Not really. For example: three cpu's and a voting system can be very much more reliable than a single cpu - triple redundancy.
- Paddy.
Brendan: good examples, but I still think such cases are a minority. Yes, caching can sometimes help to mask or recover from failures, but I could match every example of that with an actual bug I've had to fix where a cache was the cause of the failure. I also think that using endpoint/client caches as examples is a bit suspect. Having something in my DNS or browser cache might protect me from a failure, but it doesn't make the system - i.e. the collection of DNS or web servers - more reliable. It doesn't help anybody else; they can't come and get the data from me when the authoritative servers are down. That cache can also turn out to be harmful if a host moves or content changes. Have you never waited for a DNS update to propagate, or had to hit your browser's "refresh" button?
The actual systems in your examples achieve increasing reliability by adding components with replication and separation of concerns (e.g. delegating responsibility for a subdomain), independently of whatever caching might also be involved. Copying data from one location to another can serve both caching and replication purposes, but that doesn't mean the two purposes are the same. It's usually easy to make a replica serve dual duty as a cache, which might lead the naive to misperceive it as only or primarily a cache, but this non-equivalence can become painfully apparent when you try to treat a "pure cache" - i.e. one added for its own sake and not as a convenient extension of preexisting replication - as a replica. Caching in and of itself, distinct from replication, has a net negative effect on reliability.
As far as not making things behave differently is that not a function of the consumer of the cache? If the cache does not behave differently on a miss vs. a hit, I got problems, eh?
Now, it's not a cache in the sense that rather than just stockpiling data, it actually carries out program logic; but it is there to make everything work faster (and more reliably), and if it's been correctly implemented, taking away the Akamai box shouldn't make things go weird, just slower.
Add a comment: