Reputation: 3930
I have two text files that are both about 1M lines. Let's call them f1 and f2.
For each line in f1, I need to find the index of line in f2, where the line in f1 is a substring of the line in f2. Since I need to do it for all lines of f1, using nested for loop is too time-costly, and I was wondering if there is a workaround that could significantly reduce the time.
Thank you in advance for the help.
Upvotes: 2
Views: 169
Reputation: 7209
There certainly are better ways than using two for loops :D That would give you an O(n^2) runtime. Something very useful for finding substrings is called a rolling hash. It is a way to use previous information to speed up finding substrings. It goes something like this:
Say I have string, f1 = "cat"
and a long string f2 = "There once was a cat named felix"
. What you do is define a "hash" based on the letters of your f1 string. The specifics on this can be found online in various sources but for this example lets simplify things and say that letters are assigned to numbers starting at 0 going up to 25 and we'll multiply each letter's value to form a decimal number with the number of digits equaling the lenght of the string:
hash("cat") = 10^2 * 2 + 10^1 * 0 + 10^0 * 19
= some value (in python the "hash" values of letters
are not 0 through 25 but given by using the ord cast:
ord("a") will give you 97)
Now this next part is super cool. We designate windows of the size of our f1 string, so size 3, and hash the f2 string in the same way we did with f1. You start with the first three letters. The hash doesn't match so we move on. If the hash matched, we make sure it's the same string (sometimes hashes equal each other but are not the same string because of the way we assign hashes but that's ok).
COOL PART** Instead of simply shifting the window and rehashing the 2 through 4th letters of f2, we "roll" the window and don't recalculate the entire hash (which if f1 is really long would be a waste of time) since the only letters changing are the first and last! The trick is to subtract the first hash value (in our example would be ord("t")*10^2), then multiply the entire number remaining by ten (because we moved everything to the left), and add the new hash letter, ord("r") * 10^0. Check for a match again and move on. If we match, return the index.
Why are we doing this: if you have a long enough f1 string, you reduce the runtime down to O(n*len(n)) so asymptotically linear!
Now, the actual implementation takes time and could get messy but there are plenty of sources online for this kind of answer. My algorithms class has great course notes online which help understand the theory a little better and there are tons of links with python implementations. Hope this helps!
Upvotes: 6