From the spatial-overlay-as-a-service department

It’s late on a Friday night. You’re into your third hour of Guild Wars 2 and your european buddy hits you up – They want you to join their EU server. You then proceed to spend the next hour cursing the lag, and getting a lot of rubberbanding before you decide to call it a night.

Why does this happen? Why do you get strange behavior from the EU server when you’re in Los Angeles?

The root of the problem is that for decades, we have been building computing systems on the basis of a few major conceptual errors:

  1. The belief that networks can be made reliable
  2. The belief that a single arbiter of state makes a consistency-model “strong”
  3. The belief that objective simultaneity exists

As our demands on computer technology have expanded, so too has the strain induced by these faulty assumptions. We as an industry have spent the past few decades trying to cope with this foundational error in conceptualization, employing endless workarounds; Yet somehow still falling short of the reliable and predictable systems we seek. These shortcomings affect systems far and wide, and in ways which span the gamut from the mundane to the profound. This generally leads to the user being disappointed in one way or another and in the undue expenditure of human capital in an attempt to combat it.


Let’s face facts:

As it turns out, the universe we occupy is a little lacking in terms of cool sci-fi-physics. There’s no faster-than-light travel; not for spaceships, and not even for data. Worse yet, It’s not simply a matter of the galactic postal service being slow – Existence itself (read: information) has an inconvenient upper limit to how fast it can travel.


Space exists

It doesn’t matter if we’re talking about interstellar distances, nanometers on silicon, or anywhere in between; Information can only propagate through space so fast. Information is therefore local, both in terms of it’s origin, and it’s effects. Don’t hold your breath, physicists are not optimistic about FTL transportation of information through entanglement either.

and simultaneity Doesn’t

There’s no such thing as simultaneity, at least not in the way most people think about it. Whether we’re talking about wall clocks, atomic clocks, laser light pulses, simultaneity can only ever be a comparative property from the point of view of a single observer. There is no gods-eye view, no plane of simultaneity surrounding the earth. This matters, because there is no one “time”, no universal reference, no sky-hook which we can use to create a unified point of comparison for disparate events.


Digging in a bit


Any up-to-date list

A vast majority of database and other systems in production today (Mysql, TCP, etc) use linearizable/serializable consistency models, in which a single arbiter manages the linearization – Otherwise it would be a free-for-all, and you’d have a graph instead of a line (IE: non-branching chain of events.) There are various ways to phrase and think about it, but ultimately it comes down to this:

The head of a non-branching chain can only exist at a single point in space.

Yes, that point can move around, as in quorum / failover schemes, but one way or another everybody else has to travel to it in order to be using the same list. This is what consensus algorithms like Paxos and RAFT do: they essentially juggle or virtualize the end of the list. They give you a little bit more fault tolerance, but they’re not magical teleportation devices – They come at the cost of making participants wait for a quorum of the nodes. Furthermore, consensus algorithms work sort-of-okishly in a single datacenter environment where you have a “reliable” network, but one network glitch and you’re in the hurt-locker very fast. Consensus across datacenters, continents, P2P networks? forget about it.


Travel considered harmful - CAP Theorem

Gilbert and Lynch define “consistency” as linearizability, and prove (quite factually) that interacting with an up-to-date list requires traveling in space. Being subject to alligators, backhoes, network storms, etc – traveling can at times be quite perilous.

Unfortunately, in the course of their proofs, Gilbert and Lynch managed to throw out the baby with the bathwater.

When we decide that a system will use a single arbiter of truth we’re saying that either: We want to pretend that faster-than-light travel exists, OR that the user of the system is willing to wait for the round-trip journey to the arbiter, the success of which is not guaranteed – We might be waiting for a while.

Eventual Consistency – A bridge too far

Starting around the mid-2000s, and reeling in horror from the seemingly profound impact of the CAP theorem, database designers similarly proceeded to throw out the consistency-baby with the coordination-bathwater. Wisely seeking out Shared-Nothing systems, but then naively inducing their users to implement their own ad-hoc, poorly researched, poorly implemented consistency models as an overlay because of the missing feature – That is, consistency which is compatible with human expectations.


We can do better.


End-to-end

The simple truth is that system implementers must reason about their data from end to end; not just inside the walled-garden of their “database” consistency model. We assert that those who fail to reason about this plethora of consistency-models which span the gap are planning for failure. Most systems today implement no fewer than five different consistency models. Most implementers only tend to think about the first, and scratch their heads when weird stuff happens (or politely inform you that “you’re holding it wrong”)

Example of the consistency models that you might not be thinking about:

  • Relational database (linearizable/serializable)
  • TCP Connection to the RDBMS (linearizable)
  • RDBMS client / In-process pool of TCP connections (ad hoc)
  • Caching system for your service (ad-hoc, possibly wallclock-based)
  • Connection between the user’s client app and your service, including TCP/haproxy load balancer (ad-hoc, eventual consistency)
  • Caching in your end client application (ad-hoc, possibly wallclock)

As a developer, getting your head around just one consistency model is often hard enough, but several? Forget about it. Is it any wonder why our applications are so flakey these days? Ever used Slack while getting on an elevator? Ever pressed the checkout button on a shopping cart page twice? Ever had to deal with a network outage at your datacenter? How big is your global ops team, and how much of your engineering budget to you spend on cache invalidation? Modern software is overrun with examples of multiple ad-hoc, poorly conceptualized consistency models causing problems in everyday life.

(Some wonderful folks are working on applying CRDTs to try to solve these problems. While this is a good start, we do not believe that approach goes quite far enough.)


When in Rome

The physical reality around us doesn’t have centralized arbiters of truth, It’s decentralized. When I set down my glass on the table, it doesn’t have to coordinate with a datacenter in Ashburn, VA to avoid spontaneously jumping to the opposite side of the table. It has local, causal, coordination free consistency. So too, should our systems. This consistency model is totally aligned with our perspective as humans, because it’s the same consistency model we were born into.

Lets learn from natural systems and relax. Nature doesn’t do a priori resource planning, and neither should we.