Reputation: 3843
From the online discussion groups and blogs, I have seen a lot of interview questions are related to handling large scale dataset. I am wondering is there a systematic approach to analyze this type of questions? Or in more specific, is there any data structure or algorithms that can be used to deal with this? Any suggestions are really appreciated.
Upvotes: 5
Views: 3350
Reputation: 7855
"Large-scale" data sets fall into several categories that I've seen, each of which presents different challenges for you to think about.
Other problems often associated with large-scale data sets, but not size-related problems per se, are:
Upvotes: 8
Reputation: 7939
When people describe a Large data set, they frequently mean one where the entire data set can not be stored in memory. This creates challenges as to what data to load and when to load and unload it.
One approach is to use a sequential data file and process from beginning to end. That is effective when the nature of the processing is sequential, but doesn't work well when the processing needs to combine data from various parts of the data set.
Another approach is some sort of indexed file, retrieving necessary bits of data as they are needed.
A specialisation of this is the use of memory mapped files, where you let the memory manager handle the loading and caching of data.
A DBMS can greatly simplify data access, but does add some system overhead.
Upvotes: 0
Reputation: 6797
There is no silver bullet. More contextual information is needed to understand what algorithms and data structures are useful for a given, large-scale purpose. For data that's too large to fit in memory, e.g., a lot of database management systems use B+ Trees.
Upvotes: 1
Reputation: 882411
There is no single data structure or algorithm for "handling" large datasets of any nature whatsoever and for every possible purpose -- there is, rather, a vast collection of such architectures, data structures, and algorithms, for such many varied kind of data, and of required "handling" (in single-task, SMP, and distributed environments -- they may well require very different approaches in many cases).
Upvotes: 1