user1068051
user1068051

Reputation: 217

Fast method to search for a string in a big text file with python

That's my current situation:

The last point is the problem. Actually I need to search for complete match AND for partial match of the string. The algorithm I wrote just involved the use of regular expressions combined with some attempts to make the process faster: for example, I hardcoded into my script the indexes of the dictionary that identified the singular letters of the alphabet, and then split the big text file fictionary into 26 smaller dictionary. That was totally useless, the script is still incredibly slow. Skimming some posts here, I was convinced to try mmap: but it looked useless to find all the partial matches, given a regular expression. Eventually I came to the conclusion that a trie may solve my problem, though i hardly know what is this. Should I go with tries? If so, how should I proceed to the creation of a trie in python? Is marisa-trie module good? Thanks to everybody

EDIT: By "partial match", I mean that I have the prefix of a string. I do not need matches at the end or in the middle, just at the beginning.

Upvotes: 1

Views: 7859

Answers (6)

arainchi
arainchi

Reputation: 1492

Easiest and fastest solution:

#!/usr/bin/env python

d = {}

# open your file here, i'm using /etc/hosts as an example...
f = open("/etc/hosts","r")
for line in f:
    line = line.rstrip()
    l = len(line)+1
    for i in xrange(1,l):
        d[line[:i]] = True
f.close()


while True:
    w = raw_input('> ')
    if not w:
        break

    if w in d:
        print "match found", w

Here is slightly more complex, but memory efficient one:

#!/usr/bin/env python

d = []

def binary_search(a, x, lo=0, hi=None):
    if hi is None:
        hi = len(a)
    while lo < hi:
        mid = (lo+hi)//2
        midval = a[mid]
        if midval < x:
            lo = mid+1
        elif midval > x:
            hi = mid
        else:
            return mid
    return -1


f = open("/etc/hosts","r")
for line in f:
    line=line.rstrip()
    l = len(line)+1
    for i in xrange(1,l):
        x = hash(line[:i])
        d.append(x)
f.close()

d.sort()

while True:
    w = raw_input('> ')
    if not w:
        break

    if binary_search(d, hash(w)) != -1:
        print "match found", w

Upvotes: 5

silverjam
silverjam

Reputation: 400

Using a trie still requires you to build a trie, which is O(n) to iterate the whole file-- taking advantage of the sorting will make it O(log_2 n). So, this faster solution would use a binary search (see below).

This solution still requires you to read in the entire file. In an even faster solution, you could pre-process the file and pad out all the lines so they're the same length (or build some kind of index structure in the file, to make seeking to the middle of the list feasible)-- then seeking to middle of the file would take you to the middle of the list. The "even faster" solution would probably only be needed for a really, really big file (gigabytes or hundreds of megabytes). You'd them combine this with the binary search.

Possibly, if the file-system supports sparse files-- doing the above padding scheme won't event increase the files actual blocks used on the disk.

Then, at that point, you're probably approaching a b-tree or a b+tree implementation to make the indexing efficient. So you could use a b-tree library.

Something like this:

import bisect

entries = ["a", "b", "c", "cc", "cd", "ce", "d", "e", "f" ]

def find_matches(ls, m):

    x = len(ls) / 2
    match_index = -1

    index = bisect.bisect_left(ls, m)
    matches = []

    while ls[index].startswith(m):
        matches.append(ls[index])
        index += 1

    return matches

print find_matches(entries, "c")

Output:

>>> ['c', 'cc', 'cd', 'ce']

Upvotes: 0

Roland
Roland

Reputation: 529

So to explain arainchi's very nice answer, make a dictionary with an entry for every line in your file. Then you can match your search string against the names of those entries. Dictionaries are really handy for this kind of searching.

Upvotes: 0

entropy
entropy

Reputation: 3144

Use a trie.

#dictionary is a list of words
def parse_dictionary(dictionary):
    dictionary_trie = {}
    for word in dictionary:
        tmp_trie = dictionary_trie
        for letter in word:
            if letter not in tmp_trie:
                tmp_trie[letter] = {}
            if 'words' not in tmp_trie[letter]:
                tmp_trie[letter]['words'] = []

            tmp_trie[letter]['words'].append(word)
            tmp_trie = tmp_trie[letter]
    return dictionary_trie

def matches(substring, trie):
    d = trie
    for letter in substring:
        try:
            d = d[letter]
        except KeyError:
            return []
    return d['words']

Usage example:

>>> import pprint
>>> dictionary = ['test', 'testing', 'hello', 'world', 'hai']
>>> trie = parse_dictionary(dictionary)
>>> pprint.pprint(trie)
{'h': {'a': {'i': {'words': ['hai']}, 'words': ['hai']},
       'e': {'l': {'l': {'o': {'words': ['hello']}, 'words': ['hello']},
                   'words': ['hello']},
             'words': ['hello']},
       'words': ['hello', 'hai']},
 't': {'e': {'s': {'t': {'i': {'n': {'g': {'words': ['testing']},
                                     'words': ['testing']},
                               'words': ['testing']},
                         'words': ['test', 'testing']},
                   'words': ['test', 'testing']},
             'words': ['test', 'testing']},
       'words': ['test', 'testing']},
 'w': {'o': {'r': {'l': {'d': {'words': ['world']}, 'words': ['world']},
                   'words': ['world']},
             'words': ['world']},
       'words': ['world']}}
>>> matches('h', trie)
['hello', 'hai']
>>> matches('he', trie)
['hello']
>>> matches('asd', trie)
[]
>>> matches('test', trie)
['test', 'testing']
>>> 

Upvotes: 1

Mark Ransom
Mark Ransom

Reputation: 308206

Since the file is already sorted and read in, you can use a binary search on it without needing to resort to any fancy data structures. Python has a binary search function built in, bisect.bisect_left`.

Upvotes: 2

James Thiele
James Thiele

Reputation: 413

You could make a list, let each line be one element of the list and do a binary search.

Upvotes: 0

Related Questions