Linearizability And/Vs Serializability in Distributed Databases
Linearizability And Serializability together constitute the gold standard for consistency in distributed databases. Based on this gold standard, one can easily evaluate the consistency promises of various distributed database solutions.
Before we look Linearizability And Serializability together, Let us first understand these two concepts in isolation:
Linearizability: A data storage system depicts linearizability when for a set of operations, being performed on a single data object, two conditions are satisfied. First condition states that the execution is atomic for each of the operation, and second, there is a notion of sequential order in which various operations, in the set, gets executed, the order being aligned with real time.
Atomic execution here means that when an operation is executing, the same is not being interrupted by another operation concurrently. And, real time means that execution times for all operations is based on a single global wall clock.
Linearizability causes illusion that a single copy of a data object exists, although, in reality, there could be multiple copies of data, also called as replicas, on multiple nodes in a typical distributed system. Further, since, linearizability implies atomic execution of each operation, being executed on a single data object, the actual execution of two operations never interleave each other even though the two operations seems concurrent due to certain overlap of time, between invocation and response timestamps for a operation, for the two operations.
Atomic execution and single copy implies that there is a single sequential execution order of all operations, happening on a single object, on a single timeline.
Figure 1 provides an illustration of linearizable writes initiated by process P1 and P2. It can be seen that writes are being executed atomically and follow linear order (op1 -> op2 -> op3 -> op4) being based on a single wall clock where T1 < T2 < T3 < T4 (T1, T2, T3 and T4 are timestamps), as is evident from the changes in the Value of X.
Data consistency in a Linearizability World: In a linearizable world, things are simple and data consistency can be easily reasoned about, as, changes in the state of a data object results from executing one change operation at a time, and the change order follow a consistent sequential execution of operations aligned to a single clock. Therefore, once the state gets changed for a data object due to a operation, it is immediately visible to subsequent operations that are due for execution. Also, it becomes very simple to reason about the state of a data object at any time T, since one can refer to the past execution order of operations until the time T and can come to the conclusion for the state at T.
Implications of Linearizability in a Distributed System: If a distributed storage system implements Linearizability, then for all the write operations corresponding to a data object, it essentially means that all relevant replicas will apply the similar state changes, resulting from the executed write operations, in the exactly same order. The designated order is usually decided by building the consensus between replica nodes. On the contrary, for read operations on a data object, linearizability means that all clients would observe the same state of the object when they all execute read operations at time T on the underlying distributed storage system. Such a read behavior is also termed as Linearizable reads.
Figure 2 provides illustration of Linearizable Reads where Process P1 and P2 observe same value of X when both of them execute reads at the same time T.
Linearizability is composable because, if operations on each object in a system are Linearizable, then all operations in the system are Linearizable.
Serializability: Contrary to Linearizability which concerns with operations on a single object, Serializability defines strongest data consistency for database transactions in case of concurrent transactions being executed on traditional non-distributed databases.
A database transaction generally represents a business logic via a series of read/write operations on multiple data objects, and the serializability consistency definition implies that the execution history for a set of transactions should be equivalent to one of the serial execution of transactions, where only one subsequent transaction is selected for execution after the earlier one finishes.
Figure 3. shows an illustration of serializability where transactions T1, T2 are concurrent and T3 and T4 are concurrent. All the transactions are trying to overwrite the value of X. Due to Serializability, one can see that value of X changes in a way that mimics the serial execution of the transaction in the order T1 -> T2 -> T3 -> T4.
Instead of being looked upon as consistency provider, Serializability is largely understood as ‘I’, the Isolation, in the ACID context of traditional transaction semantics associated with traditional databases, The consistency onus here lies with the business logic, meaning if the business logic is consistent, then serializable isolation will also ensure consistency across the execution of multiple transactions too.
However, contrary to linearizability, serializability does not enforce any real time ordering for the serialization of transactions. Meaning, a transaction starting earlier is not guaranteed to complete execution before any other transaction that starts at a later time as compared to the former. Therefore, although, one can reason about the outcome in case of serializability, the outcome is not deterministic per se as the ordering of serialization is itself non deterministic.
The outcome reasoning is possible with serializability because when the same is enforced, then, there would be no anomalies such as dirty reads, non-repeatable reads, phantom reads, read skew, write skew. These anomalies generally arise out of concurrent execution of two or more transactions and cause inconsistency in the outcome that is difficult to reason, and therefore, absence of these anomalies means that one can refer to the serial equivalent history of the transaction execution until time T to arrive at database state at time T consistently.
Serializability in context of Distributed Systems: Since serializability do not enforce ordering, it is already known that the outcome is non- deterministic in case of overlapping concurrent transactions even for the traditional non-distributed databases where no replicas exists, however, in context of distributed systems, owing to presence of multiple copies of data, serializability can further cause undesirable behavior and ambiguity leading to data inconsistencies across replicas. This may happen, for example, if the set of transactions are applied in different order to various replicas.
Linearizability and Serializability together: After looking thru Linearizability and Serializability separately, it should be obvious that a distributed database system requires both of the them to ensure data consistency at the highest or supreme level. When both are enforced for a distributed storage system, serializability ensures that transactions issued on the storage system are serialized while linearizability ensures a single global serialization order across all distributed replicas.
Meaning, with serializability and linearizability together, each database transaction, wrapping a certain business logic, gets applied to a distributed database in such a way that every transaction feels that it is the only transaction being applied and there exists only a single copy of database. This hides all the complexities of handling concurrent transactions and replicating the resulting state changes across multiple geographically distributed replicas.
Linearizability and Serializability together is also termed as Strict Serializability or External consistency.
With ‘Strict Serializability’ in place, data states in distributed databases remain consistent across all replicas and the same can be easily reasoned about without any ambiguity, and therefore, ‘Strict Serializability’ is sets up a gold standard for data consistency for distributed databases. Various other consistency semantics, such as Sequential, Causal and Eventual are often evaluated against this gold standard to understand the capabilities and deficiencies.
Among all the available distributed databases that are available today, Google’s Spanner is one of the most popular distributed database, being offered by Google cloud, that promises external consistency, in true terms, to its applications.
In case of feedback or queries on this story, do write in the comments section., or you could also connect/message me @ https://www.linkedin.com/in/ajaywlan/