Reputation:
I'm trying to check an inputted string for names. The data I have available is every first and last name from Facebook.
What I want my program to do is take the input "johnsmith123" (for example) and return ['john', 'smith', '123']. If 'johns' and 'mith' are names in the list, I would want it to return ['john', 'smith', '123', 'johns', 'mith']. Basically: every possible combination from words in the list that can make up the entered phrase.
I know that regex tries are really, really fast for lookups. Using a tool called RegexFormat 7, I turned the wordlist into a 50mb regex trie.
Here is the code I am now trying to run, using that trie:
import io
import re
with io.open('REGEXES.rx.txt', encoding='latin-1') as myfile:
TempRegex = myfile.read()
regex = re.compile(TempRegex)
while True == True:
Password = input("Enter a phrase to be split: ")
Words = re.findall(regex, Password)
print(Words)
The program never reaches the input part. I am assuming it will take very long to compile such a large regex trie.
What I need to know is if there is some way to do this compilation process once, save the regular expression object to my disk, and just load the pre-compiled object to be used into the module instead of having to compile every time?
It is the compiling that is taking up so much time. I know that the search would actually happen quite quickly. If I can do the compilation process once, I can just run the compile overnight ...
If this is not feasible, what else can I do? The data I have available is a 100mb word list of every first and last name from Facebook, and a regex trie derived from that wordlist
Upvotes: 0
Views: 299
Reputation: 37539
I'm skeptical that a single massive regular expression is the best solution here. A single hash table of all possible first names might be faster.
all_first_names = set(['dan', 'bob', 'danny'])
username = 'dannysmith123'
# Get "extra" parts of the username if they exist
m = re.match(r'^([a-zA-Z]+)(.*)$', username)
name, extra = m.group(1), m.group(2)
# Get a list of possible first/last name splits
# [('d', 'annysmith'), ('da', 'nnysmith'), ...]
name_splits = [(name[:i], name[i:]) for i in range(1, len(name)+1)]
# Check each one of these splits to see if the first name
# is present in the master first name list, if so, add it to
# the list of valid matches.
match_list = []
for ns in name_splits:
if ns[0] in all_first_names:
match_list.extend(ns)
if extra:
match_list.append(extra)
extra = None
Upvotes: 1