Reputation: 27078
Making a specific example:
The obvious thing is to do is use a radix tree to get a list of names for the given prefix. However, this doesn't take into account the frequency information. So, instead of just having the top 5 results be the first lexical results I would like the most common 5 names:
e.g. For the prefix dan
(5913, 'Daniel')
(889, 'Danny')
(820, 'Dana')
(272, 'Dan')
(60, 'Dane')
Is there a trie tree algorithm that I've missed? Of course the ideal implementation (if one exists) is in python in my mind.
UPDATE: Generally happy with what Paddy3113 has proposed, though I will say that it blows up completely when I feed it the 2.6GB file which is one of the files I'm reducing. Looking into the details the output gives some insight:
samz;Samzetta|Samzara|Samzie
samza;Samzara
samzar;Samzara
samzara;Samzara
samze;Samzetta
samzet;Samzetta
samzett;Samzetta
samzetta;Samzetta
samzi;Samzie
samzie;Samzie
# Format - PREFIX;"|".join(CHOICES).
We've got a few more days on the bounty side of things, so I'm still looking for the killer solution. Since it's not just about the reduction but also about the lookup side of things.
Upvotes: 4
Views: 1793
Reputation: 8247
If you want quick lookups, the only real solution is to pre-compute the answers for any given prefix. This is fine in the case that the data doesn't change, but you need a way to keep your load time small.
I'd suggest using a DBM to store the pre-computed dictionary. This is basically a dictionary where the contents are stored on disk and looked up as you reference items. See http://docs.python.org/library/anydbm.html for details. The only downside is that values must be strings, so you'd need to store a comma-separated list of the top 5 entries, say, and split it on lookup.
This will have a much quicker start time than pickle, as the DB doesn't need to be loaded. It's also a lot simpler than using sqlite.
Upvotes: 0
Reputation: 214
Yes, we can use a trie. The most frequent names for a trie node are either (1) the name at that trie node or (2) a most frequent name for a child of the trie node. Here's some Python code to play with.
from collections import defaultdict
class trie:
__slots__ = ('children', 'freq', 'name', 'top5')
def __init__(self):
self.children = defaultdict(trie)
self.freq = 0
self.name = None
self.top5 = []
def __getitem__(self, suffix):
node = self
for letter in suffix:
node = node.children[letter]
return node
def computetop5(self):
candidates = []
for letter, child in self.children.items():
child.computetop5()
candidates.extend(child.top5)
if self.name is not None:
candidates.append((self.freq, self.name))
candidates.sort(reverse=True)
self.top5 = candidates[:5]
def insert(self, freq, name):
node = self[name]
node.freq += freq
node.name = name
root = trie()
with open('letter_s.txt') as f:
for line in f:
freq, name = line.split(None, 1)
root.insert(int(freq.strip()), name.strip())
root.computetop5()
print(root['St'].top5)
Upvotes: 4
Reputation: 4772
Without any idea about tuning, I'd start by assuming I have a list of names and their frequencies then construct a dictionary mapping prefixes to a set of names with that prefix and then turn each set to a list of just the top 5 names w.r.t. frequency.
Using a list of just boys names derived from here massaged to create a text file where every line is an integer frequency of occurrence, some spaces, then a name like this:
8427 OLIVER
7031 JACK
6862 HARRY
5478 ALFIE
5410 CHARLIE
5307 THOMAS
5256 WILLIAM
5217 JOSHUA
4542 GEORGE
4351 JAMES
4330 DANIEL
4308 JACOB
...
The following code constructs the dictionary:
from collections import defaultdict
MAX_SUGGEST = 5
def gen_autosuggest(name_freq_file_name):
with open(name_freq_file_name) as f:
name2freq = {}
for nf in f:
freq, name = nf.split()
if name not in name2freq:
name2freq[name] = int(freq)
pre2suggest = defaultdict(list)
for name, freq in sorted(name2freq.items(), key=lambda x: -x[1]):
# in decreasing order of popularity
for i, _ in enumerate(name, 1):
prefix = name[:i]
pre2suggest[prefix].append((name, name2freq[name]))
# set max suggestions
return {pre:namefs[:MAX_SUGGEST]
for pre, namefs in pre2suggest.items()}
if __name__ == '__main__':
pre2suggest = gen_autosuggest('2010boysnames_popularity_engwales2.txt')
If you give the dict your prefix it will then return your suggestions (together with their frequencies in this case but those can be discarded if necessary:
>>> len(pre2suggest)
15303
>>> pre2suggest['OL']
[('OLIVER', 8427), ('OLLIE', 1130), ('OLLY', 556), ('OLIVIER', 175), ('OLIWIER', 103)]
>>> pre2suggest['OLI']
[('OLIVER', 8427), ('OLIVIER', 175), ('OLIWIER', 103), ('OLI', 23), ('OLIVER-JAMES', 16)]
>>>
Look no tries :-)
Time Killer
If it takes a long time to run then you might pre-compute the dict and save it to a file then load the pre-computed values when needed using the pickle module:
>>> import pickle
>>>
>>> savename = 'pre2suggest.pcl'
>>> with open(savename, 'wb') as f:
pickle.dump(pre2suggest, f)
>>> # restore it
>>> with open(savename, 'rb') as f:
p2s = pickle.load(f)
>>> p2s == pre2suggest
True
>>>
Upvotes: 2
Reputation: 122376
Here's an idea on how to do this:
Construct a string trie and store an integer with each node in the tree. This node indicates the number of names that use that node. So you would increment all the nodes of a name when that name is inserted into the trie.
You can then determine the top names by greedily selecting the names with the highest values.
Formally it would be the same as any string trie construction algorithm but with an added step of incrementing integers.
Upvotes: 0
Reputation: 4216
You could essentially augment a trie implementation to store it's children in popularity order instead of alphabetical, that being said you'd also have to store the popularity in each node of the trie.
Upvotes: 0