I've been researching the idea of using distributed file system along with my dedicated servers instead of going with Amazon S3 and the results are nothing but massive headaches! My project have the following characteristics/requirements: User files are stored in dedicated servers . Each file is stored in 2 separate machines, located in different data centers (150-200 miles away from each other) I'm using Amazon RDS to host the associated mysql database (*). It's fairly compact (only hold IDs/files metadata) Files/data is around 50TB. Naturally, data does change and will definitely grow with time My question is : is there a good general-purpose, distributed parallel fault-tolerant file system that have the following characteristics: Stable & reasonably fast (upload/download) Fairly easy to setup & maintain Handle data storage so that I only have to care about removing/adding new servers if the need arise (ie. add new servers to the filesystem's server pool by editing a simple config, or something like that) I've read about OpenStack, GlusterFS, MogileFS, XtreemFS, etc...but the more I read, the more I get confused! (*) Yes, I realize the contradiction. Cost-wise it does make sense to host the database on RDS. But storing (up to) 50TB of users files on amazon is way too expensive compared to using dedicated servers (provided it's good enough). PS. my app isn't live yet, so I'm open to suggestion if someone have a good idea that fits well in my case. EDIT I'm not trying to make a S3 clone, I just need to use an existing hosting infrastructure to build small-scale cloud solution, my question is about finding the right distributed file system to handle/automate this.

Reputation: 3105

Good distributed general purpose filesystem in my case?

I've been researching the idea of using distributed file system along with my dedicated servers instead of going with Amazon S3 and the results are nothing but massive headaches!

My project have the following characteristics/requirements:

User files are stored in dedicated servers. Each file is stored in 2 separate machines, located in different data centers (150-200 miles away from each other)
I'm using Amazon RDS to host the associated mysql database (*). It's fairly compact (only hold IDs/files metadata)
Files/data is around 50TB. Naturally, data does change and will definitely grow with time

My question is: is there a good general-purpose, distributed parallel fault-tolerant file system that have the following characteristics:

Stable & reasonably fast (upload/download)
Fairly easy to setup & maintain
Handle data storage so that I only have to care about removing/adding new servers if the need arise (ie. add new servers to the filesystem's server pool by editing a simple config, or something like that)

I've read about OpenStack, GlusterFS, MogileFS, XtreemFS, etc...but the more I read, the more I get confused!

(*) Yes, I realize the contradiction. Cost-wise it does make sense to host the database on RDS. But storing (up to) 50TB of users files on amazon is way too expensive compared to using dedicated servers (provided it's good enough).

PS. my app isn't live yet, so I'm open to suggestion if someone have a good idea that fits well in my case.

EDIT I'm not trying to make a S3 clone, I just need to use an existing hosting infrastructure to build small-scale cloud solution, my question is about finding the right distributed file system to handle/automate this.

Upvotes: 2

Answers (3)

Zunderscore

Reputation: 158

We recently switched from an expensive storage solution to the opensource Lizardfs for our Distributed storage solution. It is quite simple to set up and scale once your understand the basic concept.

Check out https://docs.lizardfs.com/introduction.html#architecture for a quick overview. But forget about shadow master en meta loggers for now. What you need to know is that there are

a master: that regulates the traffic (make sure that has enough cpu)
chunkservers: which actually store the data. Use any kind of off the shelf hardware with a bunch of harddisks attached.
Clients: which are just simple mount points. So you can get a giant 50TB mount if you want. The master will tell the client where to find/store the files. The actual data is being transfered straight from the client->chunkserver and back.

You can add as many chunkservers as you want, the master will automatically try to balance your storage usage across them. Adding storage is a matter of adding harddrives, or adding servers. They don't have to be actual bare metal machines, but that is probably the cheapest.

There are 2 amazing features in lizardfs that allow georeplication.

Goals (see https://docs.lizardfs.com/adminguide/replication.html#standard-goals): How important are files to you. You can define, on a file level/folder level how many times a file needs to be replicated. Do you want 2 copies 3? 10? You could define a goal of 2 copies for old files that are simply there for archiving purposes. And define a goal of 4 copies on SSD drives for all new files.

Those same goals can also be used to do georeplication. You define that your data has to be stored it least two different locations by labeling your chunkservers accordingly. (e.g. DC1 and DC2)

Rack awareness (see https://docs.lizardfs.com/adminguide/advanced_configuration.html#configuring-rack-awareness-network-topology): you basically define IP ranges to teach the system how your network looks like. This way, clients will try to serve files from the closest server.

The ease of setting it up is what sold lizardfs for me. I've heard very good things about Ceph, but setting it up is another matter...

What worried me at first was how proven the technology is/was. So I spent quite a lot of research on figuring out who uses it. Orange Poland (A large telecom provider) is one of the users. And Cloudweavers/opennebula actualy built a business around it selling complete solutions.

Upvotes: 2

Onlyjob

Reputation: 5878

I recommend LizardFS and GfarmFS.

IMHO Ceph is a major disappointment and so is XtreemFS.

Upvotes: 1

Tom Andersen

Reputation: 7200

Won't it take more than one person a few months a year to manage these servers? That will cost some $, then you have the cost of hosting the data yourself, then you have the added huge cost that the business / system you are building is not obviously scalable? In addition any likely investor will be turned away by a complex home grown data hosting system. How will you ensure integrity/security on par with Amazon? Your max savings per year look like $30,000 or so.

You could save money by doing a de-duplicated storage system where you just store all the unique chunks of data - also see rsync. Don't know how redundant your data is though.

Upvotes: 1

Good distributed general purpose filesystem in my case?

Answers (3)

Related Questions