xiaohan2012
xiaohan2012

Reputation: 10342

Why NoSQL is better at scaling out than RDBMSs?

Technical blog http://tekedia.com/12083/nosql-database-advantages-and-disadvantages/ by David Adamo Jr. discusses the advantages and disadvantages of NoSQL:

For years, in order to improve performance on database servers, database administrators have had to buy bigger servers as the database load increases (scaling up) instead of distributing the database across multiple “hosts” as the load increases (scaling out). RDBMS[s] do not typically scale out easily, but the newer NoSQL databases are actually designed to expand easily to take advantage of new nodes and are usually designed with low-cost commodity hardware in mind.

Why are RDBMSs less able to scale out compared to NoSQL DBMSs?

Why buy bigger servers instead of buying cheaper ones?

Upvotes: 94

Views: 59308

Answers (8)

jessedrelick
jessedrelick

Reputation: 1457

The two primary differences between NoSQL and SQL, with only the first being a true advantage:

ACID vs BASE NoSQL typically leaves out some of the ACID features of SQL, cheating its way to higher performance by leaving this layer of abstraction to the programmer.

Horizontal Scaling The real advantage of NoSQL is horizontal scaling, aka sharding. Considering NoSQL documents are sort of self-contained objects, they can be on different servers without worrying about joining rows from multiple servers, as is the case with the relational model.

Let's say we want to return an object like this:

post {
    id: 1
    title: 'My post'
    content: 'The content'
    comments: {
      comment: {
        id: 1
      }
      comment: {
        id: 2
      }
      ...
    
    views: {
      view: {
        user: 1
      }
      view: {
        user: 2
      }
      ...
    }
}

In NoSQL, that object would be stored as is, and therefore could reside on a single server as a sort of self-contained object, without any need to join with data from other tables that could reside on other DB servers.

However, with relational DBMSs, the post would need to join with comments from the comments table, as well as views from the views table. This wouldn't be a problem in SQL until the DB is broken into shards, in which case 'comment 1' could be on one DB server while 'comment 2' is on another DB server. This makes it much more difficult to create the very same object in a RDBMS that has been scaled horizontally than in a NoSQL DB.

Upvotes: 108

XXX
XXX

Reputation: 1

Some RDBMS had the ability 'scale out' way before the NoSQL hype. There is Oracle's Real Applications Cluster and IBM's PureScale.

Upvotes: 0

Justin Russell
Justin Russell

Reputation: 1103

@jessedrelick mentioned sharding in their answer, and I want to dive into that a bit – because that's the piece ultimately made this click for me. (And MongoDB has a page that explains this really well.)

Let's take an extreme example and oversimplify it, because I think that makes it easier to understand. Imagine you're designing a database that holds every post on Instagram.

In a relational database, you'd have a posts table. Works pretty well to start! But as more and more people use Instagram, the number of posts grows... quickly. You end up needing to scale up to a bigger server in order to have space for all those new posts and accommodate all the traffic trying to access them. (Scaling's costly, especially when you consider that your traffic load won't always be the same, and you don't want to pay for resources you're not using 95% of the time.)

And beyond that, these days, many millions (or even billions) of posts are created every day on Instagram. You'll reach a point where you'll need some strategy to distribute both the data and requests (scaling out).

That's where something like ranged sharding comes into play, where your system would chunk data onto different individual nodes and have a lookup table that points you to the right ones based on your needs. And since smaller ("commodity") nodes are cheaper than huge, enterprise-level instances, it can have some cost advantages too.

If you're looking to harness some of the big advantages of relational databases, though – like joining across tables – it can get, let's say, tricky to work with data that spans different servers.

Now, I don't really think there's anything stopping you from implementing a ranged sharding lookup system like that with an RDBMS. But at that point, you've lost some key advantages that come with the relational structure (again, like joining). And it might be worth re-analyzing whether that changes whether or not you'd choose RDBMS or NoSQL.

Upvotes: 0

Jens Schauder
Jens Schauder

Reputation: 81998

Typical RDBMs make strong guaranties about consistency. This requires to some extent communication between nodes for every transaction. This limits the ability to scale out, because more nodes means more communications.

NoSQL systems make different tradeoffs. For example they don't guarantee that a second session will immediately see the data committed by the first session. Thereby decoupling the transaction of storing some data from the process of making that data available for every user. Google "eventually consistent". So a single transaction doesn't need to wait for any (or for much less) inter node communication. Therefore they are able to utilize a large amount of nodes much more easily.

Upvotes: 18

manvendra yadav
manvendra yadav

Reputation: 179

Why NoSQL databases can be easily horizontally scaled than SQL ones? I have been trying to figure out why people keep saying this. I came across many articles which only confused me with their not-industry familiar terminologies and vague assumptions. I will suggest you read Designing Data-intensive applications by Martin Kleppman. Also, I will share some of my understanding of this subject.

JOINS - in the case of many-to-one or many-to-many relationships there is no way that any database invented till now can keep the data together in one table or document so if the data is sharded(or partitioned), either it is SQL or NoSQL, the latency will be same, the database has to look for both the documents. NoSQL seems to dominate only in the case of one to many relationships. For example:

NoSql

Student

{
  "name": "manvendra",
  "education": [
    {
      "id": 1,
      "Degree": "High School"
    },
    {
      "id": 2,
      "Degree": "B.Tech"
    }
  ]
}

Eduction Institute collection

[
  {
    "id": "1",
    "name": "army public school"
  },
  {
    "id": "2",
    "name": "ABES Engineering College"
  }
]

Sql

Student Table

id | name        
1  | Manvendra

Education Institute

id | Name
1  | Army public school
2  | ABES Engineering college

Studies Table

student  | education institute | degree
1        | 1                   | high school
1        | 2                   | B.tech

Now suppose in the case of NoSql if both collection's data is on different nodes there will some extra time required to resolve the ids of the education institute and this situation is similar in the case of SQL databases so where is the benefit? I can't think of any.

Also, you must be thinking why can't we store the education institute info also in the same student collection, then it will be like:

{
  "name": "manvendra",
  "education": [
    {
      "name": "Army public school",
      "Degree": "High School"
    },
    {
      "name": "ABES Engineering College",
      "Degree": "B.Tech"
    }
  ]
}

which is really a bad design because there is a many-to-many relationship between student and education institute, many students might have studied from the same institute so tomorrow if there is a change in name or any information of the institute it will be really a very difficult challenge to change at all places.

However, in the case of a one-to-many relationship, we can club all the info together for example: Consider a customer and an order relationship

{
  "name": "manvendra",
  "order": [
    {
      "item": "kindle",
      "price": "7999"
    },
    {
      "item":"iphone 12",
      "price":"too much"
    }
  ]
}

Since an order only belongs to one customer it makes sense to store order info in one place however storing item id or name is another choice anyway, if we use SQL database here, there will be two tables with orders and customers which will not give good results to queries if data is not stored in the same node.

So saying joins in an argument as to why the NoSql database is easier to scale horizontally does not make sense.

TRANSACTIONS

Both SQL(Postgres, MySQL, etc) and NoSQL(MongoDB, Amazon's DynamoDB, etc) support transactions so there is nothing left to discuss on that.

ACID

ACID is overused just like CAP actually it is all about showing a single copy of data to the client instead actually there might be multiple copies of data(to enhance availability, fault-tolerance, etc) and what strategies the database uses to do that. For example in Postgres in the case of a master-slave distributed system, one can opt for synchronous or asynchronous replication and the replication is made possible with WAL(Write ahead logs) and same is the case in MongoDB, only in place of WAL it has oplog(Operations Log), both support streaming replication and failovers. Then where is the difference? Actually, I can't find a very strong reason that why NoSql databases can be scaled easily. What I can say is NoSql is the latest so databases come with ready-made support for horizontal scaling for example consider Mongos in MongoDB, they do all the dirty work of sharding documents, routing requests to the specific shard, etc. So tomorrow if Postgres or MySQL come up with some mechanism of intelligently sharding tables so all the related data is mostly kept in one node then it may put an end to this debate because there is nothing intrinsic in a relational database that prevents it from horizontal scaling.

On an optimistic note, I believe in the near future it will all be about the strategies. How you are planning to scale and those strategies will be independent of how you are storing data either in tables or documents. For example in Amazon's DocumentDB, there is a concept of auto-scaling in and out but if you want to achieve this with sharding it will be a burden to copy data each time you are scaling in and out. In DocumentDB this is taken care of as a shared cluster volume(data storage is separated from computing) which is nothing but a shared disk to all the instances(primary or secondary) and to escape from the risk of the shared disk failure DocumentDB replicates data of the shared disk to six other disks in different availability zones. So point to be noted here is DocumentDB mixed the concept of the shared disk and standard replication strategy to achieve its goal. So it is all about the strategy you are using in your database which is what matters

Upvotes: 15

Anuj Mehta
Anuj Mehta

Reputation: 1136

In RDBMS when the data becomes huge then it may happen that tables are spread across multiple systems and in that case performing operations like JOIN are very slow.

In case of NoSQL in general related data are stored together on same machine (either in single document - in document oriented databases or in case of Wide column datastore the related columns are on same machine). Hence its easy to scale out on a number of low end machines, obviously in this case there will be duplicate data in multiple places which is not the case in RDBMS

Upvotes: 3

Suraj Singh
Suraj Singh

Reputation: 1

For a NO SQL, 1.All the child related to a collection is at the same place and so on same server and there is no join operation to lookup data from another server .

2.There is no schema so no Locks needed on any server and the transaction handling is left to the clients .

The above 2 saves a lot of overhead of scaling in NO-SQL.

Upvotes: -1

Martin Samson
Martin Samson

Reputation: 4090

RDBMS have ACID ( http://en.wikipedia.org/wiki/ACID ) and supports transactions. Scaling "out" with RDBMS is harder to implement due to these concepts.

NoSQL solutions usually offer record-level atomicity, but cannot guarantee a series of operations will succeed (transaction).

It comes down to: to keep data integrity and support transactions, a multi-server RDBMS would need to have a fast backend communication channel to synchronize all possible transactions and writes, while preventing/handling deadlock.

This is why you usually only see 1 master (writer) and multiple slaves (readers).

Upvotes: 73

Related Questions