Reputation: 96
I came across a question. There is a sentence (say contains millions of words) and words are repetitive (e.g. Sentence is: "my name is ram, my name is shyam, ram is an engineer, shyam is also an engineer" and so on).
Goal is:
Can someone suggest what data structure should be best suitable for this requirement.
Upvotes: 2
Views: 765
Reputation: 11
One possible method is by using these three data structures:
1) Vector of integer (say V1), to store index of word in V2
2) Vector of string (say V2), to store different words
3) Map with key=string and data=integer (say M), to check if current word is new or already occured and if already occured, it points to index in V2
Steps to store:
1) Extract a word from sentence
2) Check if this word already exist in map M
-> 2.1) if it exists, get the data value respective to this key and store in index
-> 2.2) if does not exist, insert at end of V2 and get it's index, and store word with index as a pair in M
3) Store index at end of V1
4) If sentence not ended goto 1)
Steps to re-construct:
Create empty string data structure to store sentence
1) Read integer from V1 and store in index
2) Append the string for V2 at index to string
3) Repeat for every integer in V1 in order from starting till end
Note: To take care of special characters like space, newline, etc. count them as separate words.
Upvotes: 0
Reputation: 827
Since you can only store a word once, the natural solution is to store each unique word once and list the indices where each word occurs. I would use a hash table of some sort that maps strings to a list of unsigned integrals (the number of bytes in the data type depending on the length of the text), the integrals being the position of each word.
Upvotes: 0
Reputation: 35905
If these are the only requirements, I would use a simple byte array.
The key is to pre-processing and post-processing. As a pre-processor I will use a good string compressor, such as GZIP or 7z. As you would expect, a post-processor will just decompress data.
Upvotes: 2