Which data structure and/or algorithm is suitable for this problem?

Question

I have a 30MB .txt file containing random strings like:

416
abcd23
cd542
banana
bambam

There is 1 word per line, words are separated by a new line

I need to search the file for my chosen substring and return every matched string in the file. To make it clearer:

Input: cd

Output: abcd23, cd542

Are Generalized suffix trees, suffix trees or suffix arrays suitable for this kind of problem or is there something faster? (time complexity is important)

p.s. my programming skills are a bit sketchy so any kind of example would be kindly appreciated

DarrylG · Accepted Answer

Assuming you are finding the strings in the file which contain one string, then the fastest method is simply to iterate through the file and check string function 'in' or 'find' on each line as follows.

def find_matches(filename, txt):
     with open(filename, 'r') as f:
         return [line for line in f if txt in line] # using 'in'

Example Usage:

matches = find_matches('myfile.txt', 'cd')

Simply reading the file avoids the overhead of structuring the fields of other methods such as Pandas -- Pandas one of the slower methods of reading in a file. Also: What is the fastest way to search a CSV file.

String methods using in, or find basically rely on an optimized fastsearch that's implemented in C whose efficiency per string search is:

It looks like the implementation is in worst case O(N*M) (The same as a naive approach), but can do O(N/M) in some cases (where N and M are the lengths of the string and substring respectively), and O(N) in frequent cases

Which data structure and/or algorithm is suitable for this problem?

Answers (1)

Related Questions