Big Data and CAP Theorem

Big Data and CAP Theorem

In the world of NoSQL Databases, the scheme breaks down and begins to speak of concepts such as “Eventual Consistency” and “Theorem CAP”. Here, we will focus on this theorem.

cap theorem big data

In the last 10 years, there was a revolution in the way of storing information. The Relational Databases weren’t the best solution for any application, because the volume of unstructured data increased, and added to this is the fact that the storage cost was exponentially reduced.

That’s how NoSQL Databases were born, design to be scaled and distributed. Recall that relational Databases are based on the concept of transaction, which consists of fulfilling 4 ACID properties:

  • Atomicity: If the transaction is a set of steps, they form a whole, they aren’t divisible into sub-steps.
  • Consistency and Integrity: Only start what can be finished. Data starts being a valid or consistent state and after executing the transaction they pass to another coherent state.
  • Isolation: Two or more transactions made at the same time must be independent and do not affect each other.
  • Durability: If a transaction is successful, it will persist in the system.

In the world of NoSQL Databases, the scheme breaks down and begins to speak of concepts such as “Eventual Consistency”, which means that the consistency changes from being absolute to be relative to the degree of the application’s need, and “Theorem CAP”. We will focus on this theorem.

Introduction to CAP Theorem

Cap Theorem was developed by Eric Brewer at Berkeley University in the 2000s.

He argues that it’s impossible that a distributed computer system can provide the 3 following properties at a time:

  • Consistency: When performing an operation you always have to receive the same information, regardless of the node that processes the order. It means that no matter which node that forms our Database receives an order, everyone must respond to the operation equally and must be transparent to us who effected it. All clients see the same version of data.
  • Availability: the system provides answers for all requests it receives, even if one or more nodes are down.
  • Tolerance to divisions:the system still Works even though it has been divided by a network failure.

CAP Theorem

Said in other words, CAP Theorem defines that a distributed system can guarantee no more than two of three properties mentioned before.

This graphic defines the theorem:

Best nearshore software company

Spaces defined by the CAP

From the graphic we can appreciate that four possible spaces are arising which can be categorized in any Databases:

  • ND Space: Doesn’t get any Databases engine, is an empty set. This contradicts the CAP Theorem because with the latest technology it can’t accomplish with three of the Theorem properties.
  • CD Space: The engines of this space emphasize in the consistency and availability, data distribution doesn’t exist. It is the place that Relational Databases are positioned, although we can also find some NoSQL engines graph-oriented.
  • CT Space: In this space, engines will favor the availability and tolerance of divisions, but that doesn’t mean they don’t offer any consistency since this is relative and can’t guarantee between nodes.
  • DT Space: In this space tolerance of divisions and consistency are privileged, leaving aside certain level of availability. Facing a network division, these Databases could not get to respond to certain types of enquires.

Although we are not presenting examples of NoSQL engines, it would be wrong to assume that all NoSQL Databases of the same type (to learn: key value, columnar, documentary, oriented graphs) will fall into the same space. Even many of them can be configured and be part of a different category from which comes from default (for example, an engine initially belongs to space by setting CD but then passes to CT).

Finally, it is important to highlight that the CAP theorem is not absolute but, like everything in the NoSQL world, is relative. This can be illustrated with the following scenario: consistency and availability don’t go together but they are not mutually exclusive either, so we could put both in grey, resigning some consistency or some of availability, to allow a greater degree of another attribute.

This was the first article from a series that we will publish about Big Data. We hope it has been useful. We will continue with more concepts, tools, and other themes related to future publications.


Comments?  Contact us for more information. We’ll quickly get back to you with the information you need.