True NoSQL mindset

By Tomasz Kuczma

December 17, 2020

“You should use NoSQL here”, “SQL doesn’t scale” - those are the top buzz words you can hear at conferences or from candidates while interviewing for a software engineer position. 90% percent of the people cannot explain why is that. They simply repeat those empty buzz words to show that they are up to date with the “modern approach”. Is it really true that SQL does not scale? Let me put some light on the true mindset that stands behind NoSQL and SQL story.

SQL, ACID and JOINs

SQL (Structured Query Language) story goes back to the 70s. It was created by IBM and later embedded into Oracle relational database system. From that time, SQL word is used as a replacement for a technology called relational databases. Relational databases are “simple” to use. Just a set of tables (rows and columns) that are linked by “relations”. SQL allows you to query those tables for data you need and pushes the logic of how to do that effectively into the RDBMS engine.

SQL provides a transactions mechanism that is very convenient to use and makes application development easy as the ACID (atomicity, consistency, isolation, durability) guarantees are provided by the database itself. So even if your application server crashes, data will stay consistent. All you need to do is to create table schemas with a set of constraints and relations (foreign keys) to achieve consistency (and use transactions in the right way of course). Proper relational DB schema should follow “normalization principals” (3NF - Third normal form) to reduce duplication of data. If you need data from 2 tables, you can use JOINs. E.g. if you want to display comments on facebook-like posts with commenter (user) first name and last name you should store commenter_user_id in comments table and JOIN it with user table to retrieve missing data (according to 3NF).

And here, we can start seeing the cons of SQL. So far, everything was perfect - SQL “guard” our data with ACID transactions and simplify application development. Experienced readers should know that JOINs can be slow and sometimes it is good to relax 3NF rules. So why did we introduce them in the first place? The answer is very simple. In the times when SQL was introduced, storage was the most expensive part of the server. All those table data deduplication (normalization) principles were created to save disk space. Nowadays another computer resource is “the expensive one” which I explain in the next chapter.

Also, this model of being consistent all the time does not scale well horizontally, which is required to achieve global scale. 100% consistency is difficult to achieve because coordination of a complex locking system across multiple servers (imagine 1000 servers) is just hard and expensive.

NoSQL to the rescue

NoSQL is sometimes called “no JOINs” databases. Which in my opinion expresses the idea more clearly. Today the most expensive part of the server is the CPU hence NoSQL databases don’t want to waste resources to do JOINs. Of course, NoSQL is a very wide topic as it’s “Not Only SQL” so basically everything that is not SQL is classified as NoSQL including Graph databases and solutions like Google BigQuery which uses SQL-like syntax.

In the case of highly scalable NoSQL databases, most of the time you can imagine them as key-value storage (or at least I will focus on that today to simplify things). The key scalability concept here is that your data is only accessed by the key (e.g. user_id to access data about your users). With that assumption, it is extremely easy to distribute the load (CPU) and storage (disk, memory) across thousands of servers (called shards) e.g. using modulo algorithm hash(key) mod N == shard_id. It is a very simplified approach, but a complex one requires a separate article (stay tuned). That can effectively give you global scalability.

What if I need to query by multiple keys (multiple SQL-like columns e.g. user_name and user_id)? Just duplicate data in the same table (MongoDB collection) or in the second table. Just store it using insert(key=user.name, value=user) and then using insert(key=user.id, value=user). So how to do JOINs in NoSQL? Just duplicate the data directly in the item you store so you don’t need to JOIN anything. Storage is cheap! Updates of the duplicated data might become problematic - let’s skip it in this article as we can workaround it too.

The only thing you have to control and keep in mind while designing your database is the access pattern - how the data will be queried. That is the recipe for successful NoSQL.

What about ACID? That is an important question. Different NoSQL databases have different guarantees and on different levels. You need to be aware of them. E.g. DynamoDB gives you something like atomicity and durability on the item level. Of course, you should forget about full consistency all the time because of the CAP theorem. The key point is that you have to deal with it if you want to have a global scale. Usually “conditional consistent write” is the thing which can do the job - e.g. update if previous_item.version + 1 == curent_item.version where version is an integer field you control to check consistency (e.g. prevent simultenous inconsistent updates).

Of course, there are some situations where taking inconsistency risk is too big e.g. I like to get my consistent “salary wire transfer” :)

Hybrid solution - sharded SQL

FAANG companies (Facebook, Amazon, Apple, Netflix, Google, Microsoft, etc.) grew up on sharded SQL databases! That’s why they hate to hear “you need to use NoSQL” during interviews. You can effectively scale your relational DB by sharding your load based on some key. I like to use customer_id as an example to explain why. If you have 10k customers and they don’t need (or even shouldn’t) to see or share each other data, you can just use multiple SQL databases and dispatch requests to proper SQL shard based on customer_id. Then all SQL instances work independently and you use them as standard SQL DB.

What if that “independence” is not possible in my business case (Twitter users need to see each others names and “twitts”)? It’s more complicated but still should be doable if you know your access patterns. But that is a story for a separate article (coming soon I hope).

Take aways

You can scale your SQL cluster on your own (shard your data), but NoSQL can do it for you by just handling sharding and replication for you. That is the true reason for NoSQL popularity! The key point to success with NoSQL (and high scalability in general) is knowledge about your access patterns! In SQL, you can “always” create a new index on a new column, but in NoSQL, it’s not so easy. NoSQL DBs targets CPU optimization in the opposite of SQL that targets disk space optimization (3NF principals).

Hungry for more?

Check out one of my favorite talks about NoSQL - “Amazon DynamoDB Deep Dive: Advanced Design Patterns for DynamoDB” by Rick Houlihan.

Article by Tomasz Kuczma

Software engineer with a passion. Interested in computer networks and large-scale distributed computing. He loves to optimize and simplify software on various levels of abstraction starting from memory ordering through non-blocking algorithms up to system design and end-user experience. Geek. Linux user.

The views I express are my alone and they do not necessarily express the views of my employer or ex-employers. They are not investment advice nor based on any non-public information of any kind. Poglądy, które wyrażam, są tylko moje i niekoniecznie wyrażają opinie mojego pracodawcy lub byłych pracodawców. Nie są poradami inwestycyjnymi ani nie opierają się na jakichkolwiek niepublicznych informacjach.