Reputation: 5040
This is an interview question :
Suppose: I have 100 trillion elements, each of them has size from 1 byte to 1 trillion bytes (0.909 TiB). How to store them and access them very efficiently ?
My ideas : They want to test the knowledge about handling large volume of data efficiently. It is not an only-one-correct-answer question.
Save them into some special data structure ?
Actually I have no ideas about this kind of open-end question.
Any help is really appreciated.
Upvotes: 3
Views: 436
Reputation: 3862
The easiest and lowest-cost (at least until you massively scale up) option would be to use an existing service like Amazon S3.
Upvotes: 2
Reputation: 4445
I would use some distributed form of B-tree. B-tree is able to store huge ammounts of data with very good access times (the tree is usually not very deep, but very broad). Thanks to this property, it is used for indexing in relational databases. And it also wont be very difficult to distribute it among many nodes (computers).
I think, that this answer must be sufficient for an interview...
Upvotes: 2
Reputation: 1080
It really depends on the data-set in question. I think the point is for you to discuss the alternatives and describe the various pros/cons.
Perhaps you should answer their question with more questions!
The data structure you choose will depend on what kinds of trade-offs you are willing to make.
For example, if you only ever need to iterate over the set sequentially, perhaps you could should use a linked list as this it has a relatively small storage overhead.
If instead you need random access, you might want to look into:
TL;DR: It's all problem dependent. There are many alternatives.
This is essentially the same problem faced by file systems / databases.
Upvotes: 5
Reputation: 1687
Well, I would use a DHT and split it in chunks of 8MB. Then have a table with the filehash(SHA-1 256), filename, and chunks.
The chunks would be stored the chunks in 3 different NAS. Have 1200 TB NAS servers and load balancer(s) to get any of the 3 copies that are more convenient to get at the time.
Upvotes: 1