Reputation: 44765
I have a file which contains concatenated strings.
find_or_add(string)
either:
pseudocode:
file.init() // file == ""
file.find_or_add("cat") // file == "cat", returns 0
file.find_or_add("able") // file == "catable", returns 3
file.find_or_add("table") // file == "catable", returns 2
file.find_or_add("tables") // file == "catables", returns 2
file.find_or_add("spigot") // file == "catablespigot", returns 7
file.find_or_add("pig") // file == "catablespigot", returns 8
What algorithm/structure should I be looking at to 'summarise' this file in memory, and allow the required operations in at most O(log N)?
Assume that the file is bigger than RAM.
Language is not important, but I can read Pseudocode, C, Java, Python, Javascript and Haskell.
Upvotes: 4
Views: 137
Reputation: 2015
If your inserts are small then you can build a suffix tree or suffix array (using a lazy implementation). Since the inserts are < k you only need to build the tree up to that depth and the structure will only take limited memory.
edit: if you have to store the suffix ids (=integers) it won't fit in memory if the text doesn't unfortunately
The suffix tree (or suffix array which is more compact) then represents all substrings of your text and then you can do simple lookup:
Is the substring in the tree?
Yes -> return the suffix (which is in the leafs of the tree).
No -> add it and append the text to your source file.
I am willing to go deeper into this but I have to know about the pattern sizes first.
EDIT: Note that an insert then only takes O(k) time!
EDIT2: If the patterns are not limitted in length then you might have to build the full tree which is O(N) in space and time, the problem is that you usually have a factor > 10bytes/char then. Regards, irW
Upvotes: 1
Reputation: 2015
The suffix array and suffix tree will probably induce memory problems. (they always are larger then the text, even if you cut them at a certain depth since you need to store all suffixIDs in your structure).
You could create a set of files which represent the IDs of certain prefixes. Say we store all length 2 prefixes in a different file and keep it sorted. This file will contain on average 1/26^2 of the suffix ids. So we have a file aa.txt , ab.txt and so fort. The entries in the file we keep sorted (Suffix Array). Each time you want to do a lookup you use load this small file, which is already sorted and check. The complexity would be O(N) (you have to load the file which is a constant controllable fraction of your text) but you can tune the prefactor for best performance. In a 5 Gb file for example if you use length 2 prefixes then there you will have a set of 8 Mb sized files, for prefixLength 3 you will be around 320 kb and so fort..
Upvotes: 1
Reputation: 15159
Maybe this isn't applicable, but this technology and algorithm has O(log N) search, fast insertion and is optimized heavily for efficient IO with large data sets. I could be wrong, but it feels like a good balance between insertion and search. What do you think?
Upvotes: 0