Notes on Darwin, an adaptable caching system for CDNs

19 May, 2024

Caching is a problem that everyone encounters constantly throughout their day, but only computer scientists and home organizing geeks think entirely too much about. It's the idea of having certain things readily available when you need them, and less needed things stored in a less accessible place.

Consider the kitchen in my first ever apartment, which was very small:

Kitchen layout

Now, I really liked to cook, even in this tiny apartment. And I loved my kitchen gadgets. I had my Instant Pot, my wok, my sous vide, my fancy pots and pans, my cutting boards... you name it. I could put some of those gadgets on the countertops, where they'd be at arm's reach when I was cooking, but I certainly couldn't put all of them there at one time. I could put maybe three items on my countertops, and everything else went elsewhere. Let's imagine a world in which I keep everything recently used out on the countertops:

Kitchen layout, LRU example

So if I'm cooking with my roasting pan, my sauce pan, and my cutting board, I'm set. If I want to use the Instant Pot or the cast iron, though, I'm in trouble! And those are really heavy! Even if I don't use them that often, I use them enough that I don't want to be lugging them back and forth! So let's instead re-imagine my kitchen layout so that the heaviest items are always out on the countertops:

Kitchen layout, heaviest weight example

Now, let's say I want to cook something that needs just the wok, and I've currently got the cast iron, the Instant Pot, and the cutting board out on the countertop. If I'm focusing on keeping the heaviest items out and available, then I need to put the cutting board away and replace it with the wok. I might use the cutting board more frequently than the Instant Pot, but I'm not optimizing for frequency, I'm optimizing for weight.

And now we've figured out basic caching! It's the problem of deciding what goes on the countertop, and when the countertop is full, what gets put back in the cupboard to make room for more stuff.

Content Delivery Networks

Now, consider the Internet. The Internet also has metaphorical countertops, and therefore, a caching problem.

Consider the millions of people accessing the NY Times homepage every day. That page is regularly being updated, but not constantly being updated - it might get updated every 15 minutes. That's a long time in computers! We don't need to keep the NY Times application server continuously regenerating that page, so why not build it once and then cache it? Much faster.

Well, once we've cached it, then we get to think about sending it to the reader. This is where the physical nature of the Internet starts to play a role. If I'm connected directly to the NY Times app server, then it's pretty fast for me to download it. But I'm not! I'm based in Colorado, which is 1600 miles away from New York. If I could download at the speed of light, then it would take me 0.0087 seconds to download that content. That's pretty quick, though it'll be slowed down by hops across different networks.

If the NY Times is going to be caching this content anyway, why not cache it physically closer to me and all the other Coloradans and reduce that travel time even more? Why not cache it in, say, Wyoming? Cheyenne is only 101 miles from me in Denver, and it would take light 0.00054 seconds to travel between me and Cheyenne. That's 16x faster download times for me, just by moving the content physically closer to my location.

This is what a content delivery network, or CDN, does. They operate a giant caching system consisting of caching servers around the world, all trying to make it efficient for a reader in Sydney, Australia to read the NY Times without having to wait for the content to hop across the Pacific Ocean every time. And for CDNs, they are very concerned with their caching algorithms, just like I was with my kitchen countertops. No CDN can be constantly storing the entire Internet; they have to make choices. Do they try to preserve the things that are being accessed most frequently? Or updated most recently? Or would be most irritating to have to re-download? This is how they make money, and they care a lot about it.

Which is why I'm interested in this paper about Darwin, a CDN caching system that can adapt to traffic patterns to optimize its own cache. It seeks to take in different parameters to learn the best way to cache as traffic patterns change, and the authors claim a significant performance gain with it.

Introduction

CDN servers have a hierarchical structure in line with the caching principles we outlined above. The fastest cache is the Hot Object Cache, which contains everything the server can store in memory. Next is everything in the server's Disk Cache. If an inbound request comes in, the CDN server will first try to retrieve from the HOC, then the DC if it's not in the HOC, and finally via a call back to the content provider if it's not on the server as well. The server's cache management policy - the focus of this paper - determine which objects get held versus evicted.

Cache layout

CDN providers aren't hosting just one type of content for one content provider, though, they're hosting several. Akamai might host Le Monde, the New York Times and MLB highlights all in one place. The Akamai nodes in Paris might get more requests for Le Monde generally, but if there's a major news story being broken by the Times, suddenly, those traffic patterns change. This makes it challenging to design a caching strategy that can deal with these different patterns.

The authors treat this as a multi-armed bandit problem. Imagine you're in a casino, faced with a line of slot machines with an unknown probability distribution of paying out rewards. You want to maximize your chances at a payout, but in order to do this, you need a little more information. So you pull a few of the slot machines, gathering information about their payoff frequencies. Once you know enough about the machines that pay the best, you want to exploit those, hitting them again and again.

CDN operators have plenty of historical traffic data, so Darwin's first step is to cluster workloads based on a few different factors, then associate a small set of caching policies that deliver good performance on that workload. They also train a neural net that can predict one policy's performance based on another currently running policy. Both these steps can be performed offline. With these in place, Darwin maps inbound traffic to its closest associated cluster, then uses its prediction networks to select the best policy for that cluster.

Most literature focuses on cache eviction policies (which objects to kick out when the cache is full) but far less on cache admission policies (should I add this object to the cache or not). Nearly 70% of the unique objects accessed from a CDN cache are used only once, which is a waste of resources. The authors choose to focus on these admissions policies.

Why Static Policies Don't Work Well

In production, CDN HOC admission generally uses hand-tuned parameters, like pre-determined frequency and size thresholds. Different traffic patterns make this difficult to optimize. The authors simulate a 100MB HOC and a 10GB DC with real traces. Even selecting the best parameters for one particular window causes the other window to degrade by a noticeable percentage. The authors also note that small 20KB images end up taking most of the HOC in all cases, booting out the larger images and content that would actually benefit from being stored in memory.

Learning the Admission Decisions

CDN servers have four excellent characteristics that make them perfect for offline learning algorithms.

They produce a huge amount of logs that can be used for training data
It is very apparent in the data when the cache is missed, given the spike in response time
The best caching behaviors are deterministic for a given request sequence
Cache servers in general are not compute-bound, which allows them to apply learning algorithms

Admissions schemes will change over time, so the cache management policy will need to be adaptive to suit the current traffic. The authors aim for performance within 1% of the optimal hit rate.

While frequency and size are the most common considerations for a cache management policy, the authors' previous experiment on static policies indicates these will probably be insufficient. Thus, the learning-based approach will need to accommodate multiple parameters in their admissions policy.

The policy will also need to be customizable. Different hardware packages can generate different sets of requirements, such as a CDN provider wishing to minimize disk writes to preserve SSD hardware.

Finally, whatever this system does, it will need to impose low overhead on the overall CDN server. We can't drain CDN server resources just to execute our algorithm, as it would defeat the purpose of the experiment.

Darwin's Design

Darwin is designed to tweak the policies of a CDN, not to be a policy itself. There are two reasons for this:

Per-object admissions policies would likely carry a high overhead, which would violate one of the earlier stated requirements
Per-object admissions policies would be a black box, which would concern CDN operators

Instead, Darwin uses learning to test and select among a set of good policies. Each policy is defined as a tuple of frequency and size thresholds, and promotes to HOC all objects that occur more than f times and request objects of size less than s. Darwin learns to associate traffic patterns to the best-performing policy.

Offline Training

First collect historical traces of CDN server operations, generated from the logs of the servers themselves. From these, extract features, which include object sizes, inter-arrival times, and stack distances. Cluster traces based on these features. Then, associate a policy with a cluster if its hit rate is within X% of its best performing policy. This will help bootstrap the online algorithm by pre-computing many of the policies.

The authors discover that many of the policies have correlated performance, so one policy's performance can be predicted given that of another. Using this, build a neural net taking a set of features as input and output the conditional probability of a policy given the hits/misses of another. Again, this aids the online algorithm with a pre-computation step.

Consider two policies, (f₁, s₁)(f₂, s₂). In any set of requests, a request can be promoted by both, promoted by only one, or promoted by neither. Objects that are requested max(f₁, (f₂) times with size at most min(s₁, s₂) will be promoted by both, objects requested min(f₁, (f₂) times with size over max(s₁, s₂) will be promoted by neither, everything else will be promoted by one. Thus our policies are not totally independent of each other.

The policies in Darwin promote all objects that occur with more than a threshold frequency and have a size smaller than a threshold to maximize HOC hit rates. (remember, cache eviction is just a simple LRU algorithm)

Online selection

A few inbound requests "warm up" the policy selection by estimating the traffic patterns from these early requests. This points us to a traffic cluster and a set of likely policies from this cluster. We take our first few inbound requests and associate them to a set of policies. With these policies, we try each of them on the requests and associate them with a reward/hit rate. We repeat this experiment for a series of rounds until we have a high degree of confidence that we've truly selected the best policy, which is then deployed for the remainder of the epoch.

This is where our multi-armed bandit problem comes in. Our first sets of requests are equivalent to trying the first rounds on the slot machines. However, in this case, we know that certain slot machines are correlated (the policies aligned with that cluster of requests may be very close). We have to keep trying until we're nearly certain we've picked the best one. Policies that are widely different may only need one or two rounds before the probability of success gets over a threshold; policies that are very similar may need multiple rounds. Our side information informs us here, as we know from our earlier pre-training just how similar those policies are.

Evaluation

The authors evaluate Darwin on a comparable hardware setup to a standard CDN server, with traces based on prior observations. They introduce some synthetic variability into those traces to increase their test set. They then compare Darwin to a series of static and adaptive policies.

Static policies: a set of best-performing static policies on a sample of traces
Percentile: given a sample of inbound requests, map the requests to the policy closest to those requests (similar to our first offline training step)
HillClimbing: deploy one policy and concurrently run two shadow policies that include a small delta on frequency and size. If the first chosen policy is the best performing after a cycle, maintain it and introduce different variances to the shadow policies. If it's not, update with the best performing of the three and repeat.
AdaptSize: tweak the size only
DirectMapping: similar to Percentile, but with even more limited updates

Comparing directly to each of these, Darwin outperforms all of the competitors.