I came across a question. There is a sentence (say contains millions of words) and words are repetitive (e.g. Sentence is: "my name is ram, my name is shyam, ram is an engineer, shyam is also an engineer" and so on). Goal is: Memory should be efficiently used (word should be stored only once) The sentence must be re-construct-able from the data structure. Can someone suggest what data structure should be best suitable for this requirement.

Reputation: 96

What Data Structure should be used to store a large sentence (may be millions of words) with repetition of words

I came across a question. There is a sentence (say contains millions of words) and words are repetitive (e.g. Sentence is: "my name is ram, my name is shyam, ram is an engineer, shyam is also an engineer" and so on).

Goal is:

Memory should be efficiently used (word should be stored only once)
The sentence must be re-construct-able from the data structure.

Can someone suggest what data structure should be best suitable for this requirement.

Upvotes: 2

Answers (3)

Sahil

Reputation: 11

One possible method is by using these three data structures:
1) Vector of integer (say V1), to store index of word in V2
2) Vector of string (say V2), to store different words
3) Map with key=string and data=integer (say M), to check if current word is new or already occured and if already occured, it points to index in V2

Steps to store:
1) Extract a word from sentence
2) Check if this word already exist in map M
-> 2.1) if it exists, get the data value respective to this key and store in index
-> 2.2) if does not exist, insert at end of V2 and get it's index, and store word with index as a pair in M
3) Store index at end of V1
4) If sentence not ended goto 1)

Steps to re-construct:
Create empty string data structure to store sentence
1) Read integer from V1 and store in index
2) Append the string for V2 at index to string
3) Repeat for every integer in V1 in order from starting till end

Note: To take care of special characters like space, newline, etc. count them as separate words.

Upvotes: 0

j.i.h.

Reputation: 827

Since you can only store a word once, the natural solution is to store each unique word once and list the indices where each word occurs. I would use a hash table of some sort that maps strings to a list of unsigned integrals (the number of bytes in the data type depending on the length of the text), the integrals being the position of each word.

Upvotes: 0

oleksii

Reputation: 35905

If these are the only requirements, I would use a simple byte array.

The key is to pre-processing and post-processing. As a pre-processor I will use a good string compressor, such as GZIP or 7z. As you would expect, a post-processor will just decompress data.

Upvotes: 2

What Data Structure should be used to store a large sentence (may be millions of words) with repetition of words

Answers (3)

Related Questions