Finding names inside a string

Question

I'm trying to check an inputted string for names. The data I have available is every first and last name from Facebook.

What I want my program to do is take the input "johnsmith123" (for example) and return ['john', 'smith', '123']. If 'johns' and 'mith' are names in the list, I would want it to return ['john', 'smith', '123', 'johns', 'mith']. Basically: every possible combination from words in the list that can make up the entered phrase.

I know that regex tries are really, really fast for lookups. Using a tool called RegexFormat 7, I turned the wordlist into a 50mb regex trie.

Here is the code I am now trying to run, using that trie:

import io
import re

with io.open('REGEXES.rx.txt', encoding='latin-1') as myfile:
        TempRegex = myfile.read()

regex = re.compile(TempRegex)

while True == True:
    Password = input("Enter a phrase to be split: ")

    Words = re.findall(regex, Password)

    print(Words)

The program never reaches the input part. I am assuming it will take very long to compile such a large regex trie.

What I need to know is if there is some way to do this compilation process once, save the regular expression object to my disk, and just load the pre-compiled object to be used into the module instead of having to compile every time?

It is the compiling that is taking up so much time. I know that the search would actually happen quite quickly. If I can do the compilation process once, I can just run the compile overnight ...

If this is not feasible, what else can I do? The data I have available is a 100mb word list of every first and last name from Facebook, and a regex trie derived from that wordlist

Brendan Abel · Accepted Answer

I'm skeptical that a single massive regular expression is the best solution here. A single hash table of all possible first names might be faster.

all_first_names = set(['dan', 'bob', 'danny'])

username = 'dannysmith123'

# Get "extra" parts of the username if they exist
m = re.match(r'^([a-zA-Z]+)(.*)$', username)
name, extra = m.group(1), m.group(2)

# Get a list of possible first/last name splits
# [('d', 'annysmith'), ('da', 'nnysmith'), ...]
name_splits = [(name[:i], name[i:]) for i in range(1, len(name)+1)]

# Check each one of these splits to see if the first name
# is present in the master first name list, if so, add it to
# the list of valid matches.
match_list = []
for ns in name_splits:
    if ns[0] in all_first_names:
        match_list.extend(ns)
        if extra:
            match_list.append(extra)
            extra = None

Finding names inside a string

Answers (1)

Related Questions