True NoSQL mindset
By Tomasz Kuczma
“You should use NoSQL here”, “SQL doesn’t scale” - those are the top buzz words you can hear at conferences or from candidates while interviewing for a software engineer position. 90% percent of the people cannot explain why is that. They simply repeat those empty buzz words to show that they are up to date with the “modern approach”. Is it really true that SQL does not scale? Let me put some light on the true mindset that stands behind NoSQL and SQL story.
SQL, ACID and JOINs
SQL (Structured Query Language) story goes back to the 70s. It was created by IBM and later embedded into Oracle relational database system. From that time, SQL word is used as a replacement for a technology called relational databases. Relational databases are “simple” to use. Just a set of tables (rows and columns) that are linked by “relations”. SQL allows you to query those tables for data you need and pushes the logic of how to do that effectively into the RDBMS engine.
SQL provides a transactions mechanism that is very convenient to use and makes application development easy as the ACID (atomicity, consistency, isolation, durability) guarantees are provided by the database itself.
So even if your application server crashes, data will stay consistent.
All you need to do is to create table schemas with a set of constraints and relations (foreign keys) to achieve consistency (and use transactions in the right way of course).
Proper relational DB schema should follow “normalization principals” (3NF - Third normal form) to reduce duplication of data.
If you need data from 2 tables, you can use JOIN
s. E.g. if you want to display comments on facebook-like posts with commenter (user) first name and last name you should store commenter_user_id
in comments
table and JOIN
it with user
table to retrieve missing data (according to 3NF).
And here, we can start seeing the cons of SQL. So far, everything was perfect - SQL “guard” our data with ACID transactions and simplify application development.
Experienced readers should know that JOIN
s can be slow and sometimes it is good to relax 3NF rules.
So why did we introduce them in the first place?
The answer is very simple. In the times when SQL was introduced, storage was the most expensive part of the server.
All those table data deduplication (normalization) principles were created to save disk space.
Nowadays another computer resource is “the expensive one” which I explain in the next chapter.
Also, this model of being consistent all the time does not scale well horizontally, which is required to achieve global scale. 100% consistency is difficult to achieve because coordination of a complex locking system across multiple servers (imagine 1000 servers) is just hard and expensive.
NoSQL to the rescue
NoSQL is sometimes called “no JOINs” databases. Which in my opinion expresses the idea more clearly.
Today the most expensive part of the server is the CPU hence NoSQL databases don’t want to waste resources to do JOIN
s.
Of course, NoSQL is a very wide topic as it’s “Not Only SQL” so basically everything that is not SQL is classified as NoSQL including Graph databases and solutions like Google BigQuery which uses SQL-like syntax.
In the case of highly scalable NoSQL databases, most of the time you can imagine them as key-value storage (or at least I will focus on that today to simplify things).
The key scalability concept here is that your data is only accessed by the key (e.g. user_id
to access data about your users).
With that assumption, it is extremely easy to distribute the load (CPU) and storage (disk, memory) across thousands of servers (called shards) e.g. using modulo algorithm hash(key) mod N == shard_id
.
It is a very simplified approach, but a complex one requires a separate article (stay tuned).
That can effectively give you global scalability.
What if I need to query by multiple keys (multiple SQL-like columns e.g. user_name
and user_id
)?
Just duplicate data in the same table (MongoDB collection) or in the second table.
Just store it using insert(key=user.name, value=user)
and then using insert(key=user.id, value=user)
.
So how to do JOIN
s in NoSQL? Just duplicate the data directly in the item you store so you don’t need to JOIN
anything.
Storage is cheap!
Updates of the duplicated data might become problematic - let’s skip it in this article as we can workaround it too.
The only thing you have to control and keep in mind while designing your database is the access pattern - how the data will be queried. That is the recipe for successful NoSQL.
What about ACID? That is an important question. Different NoSQL databases have different guarantees and on different levels.
You need to be aware of them. E.g. DynamoDB gives you something like atomicity and durability on the item level.
Of course, you should forget about full consistency all the time because of the CAP theorem.
The key point is that you have to deal with it if you want to have a global scale.
Usually “conditional consistent write” is the thing which can do the job - e.g. update if previous_item.version + 1 == curent_item.version
where version
is an integer field you control to check consistency (e.g. prevent simultenous inconsistent updates).
Of course, there are some situations where taking inconsistency risk is too big e.g. I like to get my consistent “salary wire transfer” :)
Hybrid solution - sharded SQL
FAANG companies (Facebook, Amazon, Apple, Netflix, Google, Microsoft, etc.) grew up on sharded SQL databases! That’s why they hate to hear “you need to use NoSQL” during interviews.
You can effectively scale your relational DB by sharding your load based on some key.
I like to use customer_id
as an example to explain why.
If you have 10k customers and they don’t need (or even shouldn’t) to see or share each other data, you can just use multiple SQL databases and dispatch requests to proper SQL shard based on customer_id
.
Then all SQL instances work independently and you use them as standard SQL DB.
What if that “independence” is not possible in my business case (Twitter users need to see each others names and “twitts”)? It’s more complicated but still should be doable if you know your access patterns. But that is a story for a separate article (coming soon I hope).
Take aways
You can scale your SQL cluster on your own (shard your data), but NoSQL can do it for you by just handling sharding and replication for you. That is the true reason for NoSQL popularity! The key point to success with NoSQL (and high scalability in general) is knowledge about your access patterns! In SQL, you can “always” create a new index on a new column, but in NoSQL, it’s not so easy. NoSQL DBs targets CPU optimization in the opposite of SQL that targets disk space optimization (3NF principals).
Hungry for more?
Check out one of my favorite talks about NoSQL - “Amazon DynamoDB Deep Dive: Advanced Design Patterns for DynamoDB” by Rick Houlihan.
Software engineer with a passion. Interested in computer networks and large-scale distributed computing. He loves to optimize and simplify software on various levels of abstraction starting from memory ordering through non-blocking algorithms up to system design and end-user experience. Geek. Linux user.