Finding the line numbers of all occurences of a string in a text file

Question

I'm trying to write a function that does the following:

Given a text file, I want to find all occurences of a certain string in this file; then, for each occurence, the line on which it was found should be added to a list. We assume that each line only contains at most one occurence. The text file can get very large, which means a simple for-loop to iterate over each line the file will be too slow.

For example, say we have a file with the content:

A B C D E F G
H J K L M N O
G F E D C B A
P Q R S T U V

If I were to search for "A", the function would find it on lines 1 and 3 and thus add 1 and 3 to a list (and then return the list).

I was considering binary search, but it seems to require that a list to be sorted and the elements to be distinct - I'm looking for identical values.

So, is any other search algorithm i can base my function on, with roughly the same performance as binary search?

Thanks!

Michael Goldshteyn · Accepted Answer

You can index your lines, if they change infrequently and you will be performing many searches on them. One way to index them would be to create a histogram of which characters are present in which lines (and how many times, perhaps). Then you can invert this and say that the letter A, for example, appears on lines 5, 10 and 20. If you are searching for "ABF", you can use the inverted histogram to determine which lines are candidates (i.e., contain the letters 'A', 'B' and 'F') and then only look at those lines.

Whether or not this is an effective strategy will depend on the selectivity of your searches and the length of the search strings, among other things. Only testing will reveal whether or not the algorithm has merit for your particular usage patterns.

Finding the line numbers of all occurences of a string in a text file

Answers (1)

Related Questions