generalizing scalability

As time goes by, I like to reflect back on my experiences and generalize the lessons and observations of the past. These distillations can often be immensely useful when approaching new problems. for example - when assessing a new programming language, I ask if it meets my three general requirements for a tool - types, unicode support, and concurrency.

I have similar reflections about service scalability. over time, I have found that many of the solutions to scaling problems fall into these general categories:

statelessness. When state is eliminated or reduced, services can often be horizontally scaled. consider a front-end webserver; if it only generates html and does not act as a database or some other stateful resource, you can essentially deploy as many as you like. This principle also is relevant in programming languages. Stateless functions (or pure functions) can be dispatched to another execution resource.
asynchronicity. By this I mean, the practice of performing only a minimal set of operations in a critical path, and performing most other work in a queue or dispatch system that unblocks the main thread of execution.
commutativity. Somewhat related to asynchronicity; commutative operations can be efficiently combined since they are logically independent.
locality. This applies generally to caching of all kinds, from memcache to CDNs. Exploiting locality allows the solution to be closer to the problem.
sharding. Sharding partitions resources either logically or topologically, which allows resources to be supplemented in a nonuniform manner. Meaning, you can throw more metal at isolated choke points. As a side benefit, sharding can help enhance security and fault-tolerance.
bounds. Is the problem bounded by i/o? memory? cpu? these distinctions speak to the topology of our solution. Most web services are i/o bound, particularly content sites. Email systems like gmail are probably more bound by memory and/or cpu - a machine can only handle n distinct users due to these limits. Other tasks outside of the web realm are often cpu-bound.
refusal. Part of capacity planning is an understanding of how much load a resource should be expected to carry. Beyond the point of limit, systems should have a means by which additional load can be refused. Both client and server should understand how these limits work. HTTP code 429 is a simple example. Rate-limiting, circuit-breaking, and load-balancing are relevant concepts. Another facet of this principle is the refusal to accept unlimited inputs, which can result in resource exhaustion.
failure. Systems designed to gracefully degrade when encountering failure tend to be systems that also scale well when operating normally. It is important not to fetishize this too much though...it is impossible to eliminate single-points-of-failure.

In the end what matters is latency. If you build a system that meets your response time SLA for one user up to one billion users (or whatever other large number you had in mind), you have, by definition, built a scalable system. Scalability implies a durable global promise of acceptable latency for anticipated demand.

Of course there is more to it than that, but these generalizations have served me well.

last update 2012-04-19