bhavs
bhavs

Reputation: 2291

design a data structure to hold large amount of data

I was asked the following question at an interview, which I was unable to solve any pointers to this would be very helpful.

I have 100 files each of size 10 MB, the content of each of the file is some String mapping to a integer value.

string_key=integer value

 a=5
 ba=7
 cab=10 etc.. 

the physical RAM space available is 25 MB. How would a data structure be designed such that :

For any duplicate string_key, the integer values can be added
Display the string_key=integer value sorted in a alphabetical format

Constraint :

All the entries of a file could be unique. All of the 10*1000MB of data could be unique string_key mapping to an integer value. 

Solution 1 :

I was thinking about loading the each of the files one after the other and storing the information in a hashmap, but this hashmap would be extremely huge and there is no sufficient memory available in the RAM if all of the files contain unique data.

Any other ideas ?

Using a noSqldb is not an option.

Upvotes: 3

Views: 877

Answers (1)

chrbr
chrbr

Reputation: 179

Here's my stab at it. Basically the idea is to use a series of small binary trees to hold sorted data, creating and saving them to the disk on the fly to save memory, and a linked list to sort the trees themselves.

Hand-wavey version:

Create a binary tree sorted alphabetically based on the key of its entries. Each entry has a key and a value. Each tree has, as an attribute, the names of its first and last keys. We load each file separately, and line-by-line insert an entry into the tree, which sorts it automatically. When the size of the contents of the tree reaches 10 mb, we split the tree into two trees of 5 mb each. We save these two trees to the disk. To keep track of our trees, we keep an array of trees and their name/location and the names of their first and last attribute. From now on, for each line in a fileN, we use our list to locate the appropriate tree to insert it into, load that tree into memory, and carry out the necessary operations. We continue this process until we have reached the end.

With this method, the maximum amount of data loaded into memory will be no more than 25 mb. There is always a fileN being loaded (10mb), a tree loaded (at most 10mb), and an array/list of trees (which hopefully will not exceed 5mb).

Slightly more rigorous algorithm:

  1. Initialize a sorted binary tree B whose entries are a (key, value) tuple, sorted based on entries' property key and has properties name, size, first_key, last_key where name is some arbitrary unique string and size is the size in bytes.

  2. Initialize a sorted linked list L whose entries are tuples of the form (tree_name, first_key) sorted basec on entries' property first_key. This is our list of trees. Add the tuple (B.name, B.first_key) to L.

  3. Supposing are files are named file1, file2, ..., file100 we proceed with the following algorithm written in a pseudo-code that happens to closely resemble python. (I hope that the undeclared functions I use here are self explanatory)

    for i in [1..100]:
        f = open("file" + i)   # 10 mb into memory
        for line in file:
            (key, value) = separate_line(line)
    
            if key < B.first_key or key > B.last_key:
                B = find_correct_tree(L, key)
    
            if key.size + value.size + B.size > 10MB:
                (A, B) = B.split()     # supp A is assigned a random name and B keeps its name
                L.add(A.name, A.first_key)
                if key < B.first_key:
                    save_to_disk(B)
                    B = A      # 5 mb out of memory
                else:
                    save_to_disk(A)
    
            B.add(key)
    save_to_disk(B)
    

Then we just iterate over the list and print out each associated tree:

    for (tree_name, _) in L:
        load_from_disk(tree_name).print_in_order()

This is somewhat incomplete, e.g. to make this work you'll have to continually update the list L every single time the first_key changes; and I haven't rigorously proved that this uses 25 mb mathematically. But my intuition tells me that this would likely work. There are also probably more efficient ways to sort the trees than keeping a sorted linked list (a hashtable maybe?).

Upvotes: 2

Related Questions