Reputation: 17349
I have a list of names (strings) divided into words. There are 8 million names, each name consists of up to 20 words (tokens). Number of unique tokens is 2.2 million. I need an efficient way to find all names containing at least one word from the query (which may contain also up to 20 words, but usually only a few).
My current approach uses Python Pandas and looks like this (later referred as original
):
>>> df = pd.DataFrame([['foo', 'bar', 'joe'],
['foo'],
['bar', 'joe'],
['zoo']],
index=['id1', 'id2', 'id3', 'id4'])
>>> df.index.rename('id', inplace=True) # btw, is there a way to include this into prev line?
>>> print df
0 1 2
id
id1 foo bar joe
id2 foo None None
id3 bar joe None
id4 zoo None None
def filter_by_tokens(df, tokens):
# search within each column and then concatenate and dedup results
results = [df.loc[lambda df: df[i].isin(tokens)] for i in range(df.shape[1])]
return pd.concat(results).reset_index().drop_duplicates().set_index(df.index.name)
>>> print filter_by_tokens(df, ['foo', 'zoo'])
0 1 2
id
id1 foo bar joe
id2 foo None None
id4 zoo None None
Currently such lookup (on the full dataset) takes 5.75s on my (rather powerful) machine. I'd like to speed it up at least, say, 10 times.
I was able to get to 5.29s by squeezing all columns into one and perform a lookup on that (later referred as original, squeezed
):
>>> df = pd.Series([{'foo', 'bar', 'joe'},
{'foo'},
{'bar', 'joe'},
{'zoo'}],
index=['id1', 'id2', 'id3', 'id4'])
>>> df.index.rename('id', inplace=True)
>>> print df
id
id1 {foo, bar, joe}
id2 {foo}
id3 {bar, joe}
id4 {zoo}
dtype: object
def filter_by_tokens(df, tokens):
return df[df.map(lambda x: bool(x & set(tokens)))]
>>> print filter_by_tokens(df, ['foo', 'zoo'])
id
id1 {foo, bar, joe}
id2 {foo}
id4 {zoo}
dtype: object
But that's still not fast enough.
Another solution which seems to be easy to implement is to use Python multiprocessing (threading shouldn't help here because of GIL and there is no I/O, right?). But the problem with it is that the big dataframe needs to be copied to each process, which takes up all the memory. Another problem is that I need to call filter_by_tokens
many times in a loop, so it would copy the dataframe on every call, which is inefficient.
Note that words may occur many times in names (e.g. the most popular word occurs 600k times in names), so a reverse index would be huge.
What is a good way to write this efficiently? Python solution preferred, but I'm also open to other languages and technologies (e.g. databases).
UPD: I've measured execution time of my two solutions and the 5 solutions suggested by @piRSquared in his answer. Here are the results (tl;dr the best is 2x improvement):
+--------------------+----------------+
| method | best of 3, sec |
+--------------------+----------------+
| original | 5.75 |
| original, squeezed | 5.29 |
| zip | 2.54 |
| merge | 8.87 |
| mul+any | MemoryError |
| isin | IndexingError |
| query | 3.7 |
+--------------------+----------------+
mul+any
gives MemoryError on d1 = pd.get_dummies(df.stack()).groupby(level=0).sum()
(on a 128Gb RAM machine).
isin
gives IndexingError: Unalignable boolean Series key provided
on s[d1.isin({'zoo', 'foo'}).unstack().any(1)]
, apparently because shape of df.stack().isin(set(tokens)).unstack()
is slightly less than the shape of the original dataframe (8.39M vs 8.41M rows), not sure why and how to fix that.
Note that the machine I'm using has 12 cores (though I mentioned some problems with parallelization above). All of the solutions utilize a single core.
Conclusion (as of now): there is 2.1x improvement by zip
(2.54s) vs original squeezed
solution (5.29s). It's good, though I aimed for at least 10x improvement, if possible. So I'm leaving the (still great) @piRSquared answer unaccepted for now, to welcome more suggestions.
Upvotes: 11
Views: 524
Reputation: 1876
If you know that the number of unique tokens that you'll see is relatively small, you can pretty easily build an efficient bitmask to query for matches.
The naive approach (in original post) will allow for up to 64 distinct tokens.
The improved code below uses the bitmask like a bloom filter (modular arithmetic in setting the bits wraps around 64). If there are more than 64 unique tokens, there will be some false positives, which the code below will automatically verify (using the original code).
Now the worst-case performance will degrade if the number of unique tokens is (much) larger than 64, or if you get particularly unlucky. Hashing could mitigate this.
As far as performance goes, using the benchmark data set below, I get:
Original Code: 4.67 seconds
Bitmask Code: 0.30 seconds
However, when the number of unique tokens is increased, the bitmask code remains efficient while the original code slows down considerably. With about 70 unique tokens, I get something like:
Original Code: ~15 seconds
Bitmask Code: 0.80 seconds
Note: for this latter case, building the bitmask array from the supplied list, takes about as much time as building the dataframe. There's probably no real reason to build the dataframe; left it in mainly for ease of comparison to original code.
class WordLookerUpper(object):
def __init__(self, token_lists):
tic = time.time()
self.df = pd.DataFrame(token_lists,
index=pd.Index(
data=['id%d' % i for i in range(len(token_lists))],
name='index'))
print('took %d seconds to build dataframe' % (time.time() - tic))
tic = time.time()
dii = {}
iid = 0
self.bits = np.zeros(len(token_lists), np.int64)
for i in range(len(token_lists)):
for t in token_lists[i]:
if t not in dii:
dii[t] = iid
iid += 1
# set the bit; note that b = dii[t] % 64
# this 'wrap around' behavior lets us use this
# bitmask as a probabilistic filter
b = dii[t]
self.bits[i] |= (1 << b)
self.string_to_iid = dii
print('took %d seconds to build bitmask' % (time.time() - tic))
def filter_by_tokens(self, tokens, df=None):
if df is None:
df = self.df
tic = time.time()
# search within each column and then concatenate and dedup results
results = [df.loc[lambda df: df[i].isin(tokens)] for i in range(df.shape[1])]
results = pd.concat(results).reset_index().drop_duplicates().set_index('index')
print('took %0.2f seconds to find %d matches using original code' % (
time.time()-tic, len(results)))
return results
def filter_by_tokens_with_bitmask(self, search_tokens):
tic = time.time()
bitmask = np.zeros(len(self.bits), np.int64)
verify = np.zeros(len(self.bits), np.int64)
verification_needed = False
for t in search_tokens:
bitmask |= (self.bits & (1<<self.string_to_iid[t]))
if self.string_to_iid[t] > 64:
verification_needed = True
verify |= (self.bits & (1<<self.string_to_iid[t]))
if verification_needed:
results = self.df[(bitmask > 0 & ~verify.astype(bool))]
results = pd.concat([results,
self.filter_by_tokens(search_tokens,
self.df[(bitmask > 0 & verify.astype(bool))])])
else:
results = self.df[bitmask > 0]
print('took %0.2f seconds to find %d matches using bitmask code' % (
time.time()-tic, len(results)))
return results
Make some test data
unique_token_lists = [
['foo', 'bar', 'joe'],
['foo'],
['bar', 'joe'],
['zoo'],
['ziz','zaz','zuz'],
['joe'],
['joey','joe'],
['joey','joe','joe','shabadoo']
]
token_lists = []
for n in range(1000000):
token_lists.extend(unique_token_lists)
Run the original code and the bitmask code
>>> wlook = WordLookerUpper(token_lists)
took 5 seconds to build dataframe
took 10 seconds to build bitmask
>>> wlook.filter_by_tokens(['foo','zoo']).tail(n=1)
took 4.67 seconds to find 3000000 matches using original code
id7999995 zoo None None None
>>> wlook.filter_by_tokens_with_bitmask(['foo','zoo']).tail(n=1)
took 0.30 seconds to find 3000000 matches using bitmask code
id7999995 zoo None None None
Upvotes: 0
Reputation: 11
perhaps my first answer was a bit abstract; in the absence of real data it was generating random data to the approx. volume reqd. to get a feel for the query time. This code is practical.
data =[['foo', 'bar', 'joe'],
['foo'],
['bar', 'joe'],
['zoo']]
wordlists = {}
print "build wordlists"
for x, d in enumerate(data):
for word in d:
try:
wordlists[word].add(x)
except:
wordlists[word] = set([x])
print "query"
query = [ "foo", "zoo" ]
results = set()
for q in query:
wordlist = wordlists.get(q)
if wordlist:
results = results.union(wordlist)
l = list(results)
l.sort()
for x in l:
print data[x]
The cost in time and memory is building the wordlists (inverted indices); query is almost free. You have 12 core machine so presumably it has plenty of memory. For repeatability, build the wordlists, pickle each wordlist and write to sqlite or any key/value database with the word as key and the pickled set as binary blob. then all you need is:
initialise_database()
query = [ "foo", "zoo" ]
results = set()
for q in query:
wordlist = get_wordlist_from_database(q) # get binary blob and unpickle
if wordlist:
results = results.union(wordlist)
l = list(results)
l.sort()
for x in l:
print data[x]
alternatively, using arrays, which is more memory efficient, and probably faster to build index. pypy more that 10 x faster that 2.7
import array
data =[['foo', 'bar', 'joe'],
['foo'],
['bar', 'joe'],
['zoo']]
wordlists = {}
print "build wordlists"
for x, d in enumerate(data):
for word in d:
try:
wordlists[word].append(x)
except:
wordlists[word] = array.array("i",[x])
print "query"
query = [ "foo", "zoo" ]
results = set()
for q in query:
wordlist = wordlists.get(q)
if wordlist:
for i in wordlist:
results.add(i)
l = list(results)
l.sort()
for x in l:
print data[x]
Upvotes: 0
Reputation: 11
You can do it with reverse index; the code below run in pypy builds the index in 57 seconds, does the query or 20 words takes 0.00018 seconds and used about 3.2Gb memory. Python 2.7 build index in 158 seconds and does query in 0.0013 seconds using about 3.41Gb memory.
The fastest possible way to do this is with bitmapped reversed indexes, compressed to save space.
"""
8m records with between 1 and 20 words each, selected at random from 100k words
Build dictionary of sets, keyed by word number, set contains nos of all records
with that word
query merges the sets for all query words
"""
import random
import time records = 8000000
words = 100000
wordlists = {}
print "build wordlists"
starttime = time.time()
wordlimit = words - 1
total_words = 0
for recno in range(records):
for x in range(random.randint(1,20)):
wordno = random.randint(0,wordlimit)
try:
wordlists[wordno].add(recno)
except:
wordlists[wordno] = set([recno])
total_words += 1
print "build time", time.time() - starttime, "total_words", total_words
querylist = set()
query = set()
for x in range(20):
while 1:
wordno = (random.randint(0,words))
if wordno in wordlists: # only query words that were used
if not wordno in query:
query.add(wordno)
break
print "query", query
starttime = time.time()
for wordno in query:
querylist.union(wordlists[wordno])
print "query time", time.time() - starttime
print "count = ", len(querylist)
for recno in querylist:
print "record", recno, "matches"
Upvotes: 1
Reputation: 294506
idea 0
zip
def pir(s, token):
return s[[bool(p & token) for p in s]]
pir(s, {'foo', 'zoo'})
idea 1
merge
token = pd.DataFrame(dict(v=['foo', 'zoo']))
d1 = df.stack().reset_index('id', name='v')
s.ix[d1.merge(token).id.unique()]
idea 2
mul
+ any
d1 = pd.get_dummies(df.stack()).groupby(level=0).sum()
token = pd.Series(1, ['foo', 'zoo'])
s[d1.mul(token).any(1)]
idea 3
isin
d1 = df.stack()
s[d1.isin({'zoo', 'foo'}).unstack().any(1)]
idea 4
query
token = ('foo', 'zoo')
d1 = df.stack().to_frame('s')
s.ix[d1.query('s in @token').index.get_level_values(0).unique()]
Upvotes: 4
Reputation: 5279
I have done similar things with the following tools
Hbase - Key can have Multiple columns (Very Fast)
ElasticSearch - Nice easy to scale. You just need to import your data as JSON
Apache Lucene - Will be very good for 8 Million records
Upvotes: 1