Parand
Parand

Reputation: 106290

Python: Finding partial string matches in a large corpus of strings

I'm interested in implementing autocomplete in Python. For example, as the user types in a string, I'd like to show the subset of files on disk whose names start with that string.

What's an efficient algorithm for finding strings that match some condition in a large corpus (say a few hundred thousand strings)? Something like:

matches = [s for s in allfiles if s.startswith(input)]

I'd like to have the condition be flexible; eg. instead of a strict startswith, it'd be a match so long as all letters in input appears in s in the same order. What's better than the brute-force method I'm showing here?

Upvotes: 2

Views: 5605

Answers (5)

Brian
Brian

Reputation: 119221

For exact matching, generally the way to implement something like this is to store your corpus in a trie. The idea is that you store each letter as a node in the tree, linking to the next letter in a word. Finding the matches is simply walking the tree, and showing all children of your current location. eg. "cat", "cow" and "car" would be stored as:

  a--t
 / \ 
c   r
 \
  o--w

When you get a c, you start at the c node, an a will then take you to the c/a node (children "t" and "r", making cat and car as your completions).

Note that you'll also need to mark nodes that are complete words to handle names that are substrings of others (eg "car" and "cart")

To get the desired fuzzy matching, you may need to make some changes however.

Upvotes: 7

unbeknown
unbeknown

Reputation:

Maybe the readline module is what you are looking for. It is an interface to the GNU readline library Python Documentation. Maybe you can supply your own completition function with set_completer().

Upvotes: 1

James Hopkin
James Hopkin

Reputation: 13973

(addressing just the string matching part of the question)

If you want to try something quickly yourself, why not create a few dictionaries, each mapping initial patterns to lists of strings where

  • Each dictionary is keyed on initial patterns of a particular length
  • All the strings in a string list start with the initial pattern
  • An initial pattern/string list pair is only created if there are less than a certain number (say 10) of strings in the list

So, when the user has typed three characters, for example, you look in the dictionary with keys of length 3. If there is a match, it means you have between 1 and 10 possibilities immediately available.

Upvotes: 0

Philippe F
Philippe F

Reputation: 12165

The flexibility you want for matching your string is called Fuzzy Matching or Fuzzy Searching . I am not aware of any python implementation (but I haven't looked deeply in the subject) but there are C/C++ implementations that you can reuse, like the TRE packaged that supports regexp with fuzzy parameters.

Apart from that, there is always the question of whether the total list of your words fits in memory or not. If not, keeping them in a list is not feasible and will have to cache something to disk or to a database.

Upvotes: 0

Toni Ruža
Toni Ruža

Reputation: 7512

I used Lucene to autocomplete a text field with more then a hundred thousand possibilities and I perceived it as instantaneous.

Upvotes: 3

Related Questions