GridFS: what it gives us

Question

I'm reading "Seven Databases in Seven Weeks". Could you please explain me the text below:

One downside of a distributed system can be the lack of a single coherent filesystem. Say you operate a website where users can upload images of themselves. If you run several web servers on several different nodes, you must manually replicate the uploaded image to each web server’s disk or create some alternative central system. Mongo handles this scenario by its own distributed filesystem called GridFS.

Why do you need replicate manually uploaded images? Does they mean some of the servers will have linux and some of them Windows?
Do all replicated data storages tend to implement own filesystem?

Markus W Mahlberg · Accepted Answer

On the need for data distribution and its intricacies

Let us dissect the example in a bit more detail. Say you have a web application where people can upload images. You fire up your server, save the images to the local machine in /home/server/app/uploads, the users use the application. So far, so good.

Now, your application becomes the next big thing, you have tens of thousands of concurrent users and your single server simply can not handle that load any more. Luckily, aside from the fact that you store the images in the local file system, you implemented the application in a way that you could easily put up another instance and distribute the load between them. But now here comes the problem: the second instance of your application would not have access to the images stored on the first instance – bad thing.

There are various ways to overcome that. Let us take NFS as an example. Now your second instance can access the images, and even store new ones, but that puts all the images on one machine, which sooner or later will run out of disk space.

Scaling storage capacity can easily become a very expensive part of an application. And this is where GridFS comes to help. It uses the rather easy means of MongoDB to distribute data across many machines, a process which is called sharding. Basically, it works like this: Instead of accessing the local filesystem, you access GridFS (and the contained files within) via the MongoDB database driver.

As for the OS: Usually, I would avoid mixing different OSes within a deployment, if at all possible. Nowadays, there is little to no reason for the average project to do so. I assume you are referring to the "different nodes" part of that text. This only refers to the fact that you have multiple machines involved. But they perfectly can run the same OS.

Sharding vs. replication

Note The following is vastly simplified, because going into details would well exceed the scope of one or more books.

The excerpt you quoted mixes two concepts a bit and is not clear enough on how GridFS works.

Lets first make the two involved concepts a bit more clear. Replication is roughly comparable to a RAID1: The data is stored on two or more machines, and each machine holds all data.

Sharding (also known as "data partitioning") is roughly comparable to a RAID0: Each machine only holds a subset of the data, albeit you can access the whole data set (files in this case) transparently and the distributed storage system takes care of finding the data you requested (and decides where to store the data when you save a file)

Now, MongoDB allows you to have a mixed form, roughly comparable to RAID10: The data is distributed ("partitioned" or "sharded") between two or more shards, but each shard may (and almost always should) consist of a replica set, which is an uneven number of MongoDB instances which all hold the same data. This mixed form is called a "sharded cluster with a replication factor of X", where X denotes the non-hidden members per replica set.

The advantage of a sharded cluster is that there is no single point of failure any more:

Depending on your replication factor, one or more replica set members can fail, and the cluster is still working
There are servers which hold the metadata (which part of the data is stored on which shard, for example). Those are called config servers. As of MongoDB version 3.0.x (iirc), they form a replica set themselves – not much of a problem if a node fails.
You access a sharded cluster via a the mongos sharded cluster query router of which you usually have one per instance of your application (and most often on the same server as your application instance). But: most drivers can be given multiple mongos instances to connect to. So if one of those mongos instances fails, the driver will happily use the next one you configured.

Another advantage is that in case you need to add additional storage or have more IOPS than your current system can handle, you can add another shard: MongoDB will take care of distributing the existing data between the old shards and the new shard automagically. The details on how this is done are covered in the introduction to Sharding in the MongoDB docs.

The third advantage – and the one that has the most impact, imho – is that you can distribute (and replicate) data on relatively cheap commodity hardware, whereas most other technologies offering the benefits of GridFS on a sharded cluster require you to have specialized and expensive hardware.

A disadvantage is of course that this setup only is feasible if you have a lot of data, since many machines are necessary to set up a sharded cluster:

At least 3 config servers
At least a single shard, which should consist of a replica set. The minimal setup would be two data bearing nodes plus an arbiter

But: in order to use GridFS in general, you do not even need a replica set ;). To stay within our above example: Both instances of your application could well access the same MongoDB instance holding a GridFS.

Do all replicated data storages tend to implement own filesystem?

Replicated? Not necessarily. There is DRBD for example, which could be described as "RAID1 over ethernet".

Assuming we have the same mixup of concepts here as we had above: Distributed file systems by their very definition implement a file system.

GridFS: what it gives us

Answers (2)

On the need for data distribution and its intricacies

Sharding vs. replication

Do all replicated data storages tend to implement own filesystem?

Related Questions