Joe Mosby's blog

Notes on Spanner, Google's massively distributed database

Spanner

Spanner is consistently cited as one of the papers to read if you're pursuing a senior engineering role, and with good reason. It takes an intense database to support Google's advertising business as well as Gmail and Google Photos, along with all of the use cases its Google Cloud Platform clients have developed over the years. It's technology that works extremely well in a hard set of use cases.

Back when I was focused on high-performance computing, I was in the Hadoop world, which meant I was using a variant of Bigtable (another of Google's big data tools). I was familiar with Spanner, but had never gone through the original paper behind it. This isn't new - it'll be familiar to most folks who have studied it for system design interviews - but it is new to me.

Introduction

Spanner needs a little pre-work to make sense. At a high level, it's a database that uses Paxos state machines that are located in data centers around the world. In order for us to make sense of Spanner, we need a little background on Paxos. Paxos is an algorithm for establishing consensus across unreliable processes (such as when the network might be down in Greece, but available in Estonia). Paxos works by having one node propose a message to multiple other nodes, which then "vote" to accept or reject the message. In our database example, a node might receive a SQL command UPDATE table SET name = 'Joe' WHERE id = 4567, and the other nodes would need to vote if this is a change to accept or not. The nodes might reject the message if they've already received that message before and have since further updated the data. There's a formal process for working these votes and pushing messages around, which I'll skip for now: suffice it to say that Spanner uses that consensus algorithm to work changes.

Spanner needs to do a lot of replication, because it has no idea who might need data, when they need it, and from where they might need it. So Spanner needs to handle automatic failover between replicas, and it needs to automatically push data around machines as needed. And it's got to do this at scale - trillions of database rows and millions of machines.

Though Spanner is focused on gigantic replicated datasets, it also contains some features designed to address performance concerns with other databases at Google. Bigtable, for example, is difficult to use with complex schemas. As a result, Spanner moved from a versioned key-value store (like Bigtable) to a temporal multi-version database. It has versioned data, schemas, and timestamps for data; old versions of data are garbage-collected. It also provides a SQL-based query language.

Spanner exposes several of its distributed features to applications. An application can specify which datacenters can contain which data, along with how far data might be from its users. Applications can also control data movement across databases. Spanner also provides externally consistent reads and writes, and globally-consistent reads across the database when given a timestamp. To facilitate this, Spanner also contains a TrueTime API, which exposes uncertainty in the clock. In periods of high uncertainty, Spanner will slow down to wait it out.

Implementation

A deployment of Spanner is known as a universe, which are then sub-partitioned into zones. A universe are things like dev/stage/prod, a zone consists of the physical and logical hardware on which data resides. There may be multiple zones per datacenter, but a zone will be contained within a single datacenter. Each zone contains a single zonemaster server, with several thousand spanservers reporting to the zonemaster. The zonemaster assigns data out to spanservers, while the spanservers serve data to clients. The location proxies are used by clients to identify which spanserver holds the data the client wishes to access.

spanserver stack

Each spanserver is responsible for 100-1000 instances of a data structured called a tablet, which implements the following key/value mappings:

(key:string, timestamp:int64) -> string

The tablet's state is stored in B-tree-like files on Google's Colossus file system. On top of each table, the spanserver contains a Paxos state machine. The Paxos machine manages writes to the data, which include writing to the tablet's log and to the Paxos log. These writes are appended in sequential order.

These Paxos state machines are used to replicate the key/value mappings across a set of replicas. Writes must initiate the Paxos protocol at the leader, while reads can access the data directly from the table at any up-to-date replica. Each spanserver also implements a lock table for concurrency control. At every leader replica, the spanserver also implements a transaction manager, which is used to coordinate transactions when a transaction must be implemented across multiple Paxos groups.

Directories and Data Model

Spanner contains a bucketing abstraction called a directory, implemented as a set of contiguous keys that share a common prefix. The authors note that bucket would be a more appropriate term. Directories allow applications to control the locality of their data through key structure.

When data is moved between a Paxos group, it is done not at the individual key level, but at the directory level. Spanner primarily does this for optimizations, so a single Paxos group is not overloaded, or that directories can be moved into a group closer to accessors.

Movedir is the background task that manages this movement. It is implemented as a multiple-transaction task, so it can be done in chunks to avoid any blocking operations. It registers the fact that it is starting to move data, makes the moves, then files a final transaction to update the metadata for the two Paxos groups.

Spanner's data model exposes semi-relational tables, a SQL-like query language, and general transactions to users, all of which were implemented due to usability concerns in other Google databases. The application data model is layered on top of the key-value mappings, so each "row" must have a name (for its key), but the values will have an enforced schema that allows SQL-like operations. With this feature in place, applications can control data locality - even within the same table - using specific key choices.

TrueTime

While a standard datetime library gives a particular timestamp, TrueTime gives a confidence interval. The time is definitely within the interval [earliest, latest] but the internal clock cannot get more confident than that. TrueTime uses various atomic clocks (which can drift significantly due to frequency errors) and GPS reference clocks (which can be affected by radio interference, antenna failures, etc.). TrueTime exposes this drift, which is used for transaction timestamp handling.

TrueTime is implemented by a set of time master machines per datacenter. Most masters have GPS receivers with dedicated antennas that are separated physically to reduce uncertainty, with a few masters containing atomic clocks. These reference clocks are all regularly checked against each other, and against the internal local clock. If a clock begins to show dramatic drift against the others, it evicts itself. Clock daemons on individual servers poll from a variety of masters, including those at other datacenters. Between synchronizations with various masters, an individual clock daemon will show a slowly increasing time uncertainty.

Concurrency Control

TrueTime is implemented to facilitate concurrency control, alongside the features of the Paxos state machines. Spanner supports three types of transactions: read-write transactions, read-only transactions, and snapshot reads. A standalone write is implemented as a read-write transaction.

A read-only transaction executes at a system-chosen timestamp with no locking, and can occur on any replica. It must be predeclared as having no writes. A snapshot read, by contrast, requires that the client specify a timestamp. In the event that a server fails mid-transaction, clients can continue the query on a different server by repeating the timestamp and the read position.

Transactional reads and writes use two-phase locking, so they can be assigned timestamps when all locks have been acquired, but before any locks have been released. Spanner will match the Paxos timestamp to the transaction timestamp.

Spanner enforces a external-consistency invariant that says that if the start of a transaction T2 occurs after the commit of a transaction T1, then the commit timestamp of T2 must be greater than T1. This means that clients cannot see any data committed by Ti until TT.after(si) is true. This is also enforced within Paxos, where every replica tracks a value safe time that is used as an upper bound for reads. If a read comes in requesting data at a timestamp greater than tsafe, the given replica will not fulfill the request.

Just like Bigtable, writes in a read-write transaction are buffered at the client until a commit - so reads in a transaction won't see the transaction's writes. The client issues reads to the leader replica of the group, which acquires necessary read locks and then reads data. When the client has completed all its reads and buffered all its writes, it then chooses a coordinator group and sends a commit message to each participant's leader. The client thus drives two-phase commits and avoids sending data twice.

The transaction leaders acquire write locks and choose a prepare timestamp that must be greater than any other previous timestamps. The coordinator leader chooses a timestamp for the entire transaction after hearing for all other leaders, which must be the greatest of all given timestamps. The coordinator then logs a transaction commit record through Paxos.

Before allowing any coordinator replica to actually apply the commit record, the leader waits until TT.after(s) to guarantee a valid transaction. It will wait until that timestamp is guaranteed to be in the past. After the timestamp has passed, the commits are applied at that timestamp, and locks are released.

Evaluation

Database evaluation criteria were pretty standard here, so I'm skipping any notes. However, I did think the benchmarking strategy was interesting. Each spanserver was a fairly standard 4-core, 4GB RAM machine with clients run on separate machines. Each zone contained one spanserver, and clients and zones were placed within the same datacenter. Their rationale was that most applications will not do worldwide distributed data. The test database contained 50 Paxos groups with 2500 directories, and all operations were reads/writes of ~4KB.

I thought the choice of data distribution was curious, as the first known application (F1, which is the advertising backend) is most definitely a globally distributed database. They still got pretty good results out of this: 8.7ms reads with a fairly large standard deviation, and multi-site commits only needing 103.0ms.