Reputation: 1393
I would like to know what is the best solution for storing large amount of images on multiple servers like google, facebook.
It seems that storing in filesystem is better then inside a database but what about using a noSQL DB like cassandra.
Do Google/Facebooke store the same image in multiple servers for the load balancing. How does it work? What is the best solution?
Thx a lot
Upvotes: 6
Views: 4074
Reputation: 1426
Note, I know this is an old question, I just want to counter balance some misconceptions about cost as I'm doing this right now as a test.
Unlike what DavidB thinks, it does not cost millions - even if you were to run dedicated hosted hardware, you'd easily be under a couple thousand/month (BTDT, one of my clients is running a 8 node cluster for about $800/month). That said, that's a maintenance headache you want to avoid, and Cassandra on EC2 is far easier to deal with.
You could easily run a substantial production cloud on EC2 for less than $1000/month and you can do R&D clouds for less than $100/month (I spend about $52 last month for an 10 machine test cluster). I highly recommend using TurnKey Linux to manage & provision your R&D farm, as their tools will allow you to migrate instances from your desktop to pretty much any virtualized hosting platform in a few minutes (and vice versa). Plus they have really slick integration with EC2.
For really serious levels of traffic, Pintrest once stated they spend $15 to $50/hour depending on server load, auto-scaling to meet traffic demands, see http://www.theregister.co.uk/2012/04/30/inside_pinterest_virtual_data_center/ for details
The real cost is in setup and manage of your distributed Cassandra instance. Luckily, NetFlix has just release a ton of manage tools just for this. You can find them here: https://github.com/netflix - there are also a ton of interesting videos about NetFlix's use of AWS, particularly moving stuff from Cassandra to S3 - see their blog here http://techblog.netflix.com/2012/12/videos-of-netflix-talks-at-aws-reinvent.html
Upvotes: 1
Reputation: 7001
There's nothing wrong with the approach you're taking. As mentioned, there are caveats, however, the possibilities do exist, and a lot of people and companies are successfully storing files in Apache Cassandra.
The principal behind this is to take a file, break it into a set of chunks and store those chunks as columns in a row. When retrieving, pull each column, reassemble the file and voila.
Cassandra FAQ: large file and blog storage
...files of around 64Mb and smaller can be easily stored in the database without splitting them into smaller chunks...
...its files are broken down into blocks (whose sizes are capped), where each block (see FileBlock) is stored as the value of a column in the corresponding row...
You'll get more positive feedback on the Cassandra mailing list and on the IRC channel.
Finally, this is from 2009, and written by folks at Facebook, which should go some way to help answer more of the fundamental questions you have: Cassandra - A Decentralized Structured Storage System.
Upvotes: 4
Reputation: 2234
If you want to store in a "cloud" environment you're best going with a cloud solution that has the resources such as Google App Engine or Amazon Web Services. You're not going to be able to setup your own if that is the question. It will costs millions of dollars and resources to manage them. And yes, Google and Facebook use thousands of servers to distribute their data in "clouds".
Upvotes: -1