Dennis Golomazov
Dennis Golomazov

Reputation: 17349

Efficient lookup by common words

I have a list of names (strings) divided into words. There are 8 million names, each name consists of up to 20 words (tokens). Number of unique tokens is 2.2 million. I need an efficient way to find all names containing at least one word from the query (which may contain also up to 20 words, but usually only a few).

My current approach uses Python Pandas and looks like this (later referred as original):

>>> df = pd.DataFrame([['foo', 'bar', 'joe'], 
                       ['foo'], 
                       ['bar', 'joe'], 
                       ['zoo']], 
                      index=['id1', 'id2', 'id3', 'id4'])
>>> df.index.rename('id', inplace=True)  # btw, is there a way to include this into prev line?
>>> print df

       0     1     2
id                  
id1  foo   bar   joe
id2  foo  None  None
id3  bar   joe  None
id4  zoo  None  None

def filter_by_tokens(df, tokens):
    # search within each column and then concatenate and dedup results    
    results = [df.loc[lambda df: df[i].isin(tokens)] for i in range(df.shape[1])]
    return pd.concat(results).reset_index().drop_duplicates().set_index(df.index.name)

>>> print filter_by_tokens(df, ['foo', 'zoo'])

       0     1     2
id                  
id1  foo   bar   joe
id2  foo  None  None
id4  zoo  None  None    

Currently such lookup (on the full dataset) takes 5.75s on my (rather powerful) machine. I'd like to speed it up at least, say, 10 times.

I was able to get to 5.29s by squeezing all columns into one and perform a lookup on that (later referred as original, squeezed):

>>> df = pd.Series([{'foo', 'bar', 'joe'}, 
                    {'foo'}, 
                    {'bar', 'joe'}, 
                    {'zoo'}], 
                    index=['id1', 'id2', 'id3', 'id4'])
>>> df.index.rename('id', inplace=True)
>>> print df

id
id1    {foo, bar, joe}
id2              {foo}
id3         {bar, joe}
id4              {zoo}
dtype: object

def filter_by_tokens(df, tokens):
    return df[df.map(lambda x: bool(x & set(tokens)))]

>>> print filter_by_tokens(df, ['foo', 'zoo'])

id
id1    {foo, bar, joe}
id2              {foo}
id4              {zoo}
dtype: object

But that's still not fast enough.

Another solution which seems to be easy to implement is to use Python multiprocessing (threading shouldn't help here because of GIL and there is no I/O, right?). But the problem with it is that the big dataframe needs to be copied to each process, which takes up all the memory. Another problem is that I need to call filter_by_tokens many times in a loop, so it would copy the dataframe on every call, which is inefficient.

Note that words may occur many times in names (e.g. the most popular word occurs 600k times in names), so a reverse index would be huge.

What is a good way to write this efficiently? Python solution preferred, but I'm also open to other languages and technologies (e.g. databases).


UPD: I've measured execution time of my two solutions and the 5 solutions suggested by @piRSquared in his answer. Here are the results (tl;dr the best is 2x improvement):

+--------------------+----------------+
|       method       | best of 3, sec |
+--------------------+----------------+
| original           | 5.75           |
| original, squeezed | 5.29           |
| zip                | 2.54           |
| merge              | 8.87           |
| mul+any            | MemoryError    |
| isin               | IndexingError  |
| query              | 3.7            |
+--------------------+----------------+

mul+any gives MemoryError on d1 = pd.get_dummies(df.stack()).groupby(level=0).sum() (on a 128Gb RAM machine).

isin gives IndexingError: Unalignable boolean Series key provided on s[d1.isin({'zoo', 'foo'}).unstack().any(1)], apparently because shape of df.stack().isin(set(tokens)).unstack() is slightly less than the shape of the original dataframe (8.39M vs 8.41M rows), not sure why and how to fix that.

Note that the machine I'm using has 12 cores (though I mentioned some problems with parallelization above). All of the solutions utilize a single core.

Conclusion (as of now): there is 2.1x improvement by zip (2.54s) vs original squeezed solution (5.29s). It's good, though I aimed for at least 10x improvement, if possible. So I'm leaving the (still great) @piRSquared answer unaccepted for now, to welcome more suggestions.

Upvotes: 11

Views: 524

Answers (5)

jvd10
jvd10

Reputation: 1876

If you know that the number of unique tokens that you'll see is relatively small, you can pretty easily build an efficient bitmask to query for matches.

The naive approach (in original post) will allow for up to 64 distinct tokens.

The improved code below uses the bitmask like a bloom filter (modular arithmetic in setting the bits wraps around 64). If there are more than 64 unique tokens, there will be some false positives, which the code below will automatically verify (using the original code).

Now the worst-case performance will degrade if the number of unique tokens is (much) larger than 64, or if you get particularly unlucky. Hashing could mitigate this.

As far as performance goes, using the benchmark data set below, I get:

Original Code: 4.67 seconds

Bitmask Code: 0.30 seconds

However, when the number of unique tokens is increased, the bitmask code remains efficient while the original code slows down considerably. With about 70 unique tokens, I get something like:

Original Code: ~15 seconds

Bitmask Code: 0.80 seconds

Note: for this latter case, building the bitmask array from the supplied list, takes about as much time as building the dataframe. There's probably no real reason to build the dataframe; left it in mainly for ease of comparison to original code.

class WordLookerUpper(object):
    def __init__(self, token_lists):
        tic = time.time()
        self.df = pd.DataFrame(token_lists,
                    index=pd.Index(
                        data=['id%d' % i for i in range(len(token_lists))],
                        name='index'))
        print('took %d seconds to build dataframe' % (time.time() - tic))
        tic = time.time()
        dii = {}
        iid = 0
        self.bits = np.zeros(len(token_lists), np.int64)
        for i in range(len(token_lists)):
            for t in token_lists[i]:
                if t not in dii:
                    dii[t] = iid
                    iid += 1
                # set the bit; note that b = dii[t] % 64
                # this 'wrap around' behavior lets us use this
                # bitmask as a probabilistic filter
                b = dii[t]
                self.bits[i] |= (1 << b)
        self.string_to_iid = dii
        print('took %d seconds to build bitmask' % (time.time() - tic))

    def filter_by_tokens(self, tokens, df=None):
        if df is None:
            df = self.df
        tic = time.time()
        # search within each column and then concatenate and dedup results    
        results = [df.loc[lambda df: df[i].isin(tokens)] for i in range(df.shape[1])]
        results = pd.concat(results).reset_index().drop_duplicates().set_index('index')
        print('took %0.2f seconds to find %d matches using original code' % (
                time.time()-tic, len(results)))
        return results

    def filter_by_tokens_with_bitmask(self, search_tokens):
        tic = time.time()
        bitmask = np.zeros(len(self.bits), np.int64)
        verify = np.zeros(len(self.bits), np.int64)
        verification_needed = False
        for t in search_tokens:
            bitmask |= (self.bits & (1<<self.string_to_iid[t]))
            if self.string_to_iid[t] > 64:
                verification_needed = True
                verify |= (self.bits & (1<<self.string_to_iid[t]))
        if verification_needed:
            results = self.df[(bitmask > 0 & ~verify.astype(bool))]
            results = pd.concat([results,
                                 self.filter_by_tokens(search_tokens,
                                    self.df[(bitmask > 0 & verify.astype(bool))])])
        else:
            results = self.df[bitmask > 0]
        print('took %0.2f seconds to find %d matches using bitmask code' % (
                time.time()-tic, len(results)))
        return results

Make some test data

unique_token_lists = [
    ['foo', 'bar', 'joe'], 
    ['foo'], 
    ['bar', 'joe'], 
    ['zoo'],
    ['ziz','zaz','zuz'],
    ['joe'],
    ['joey','joe'],
    ['joey','joe','joe','shabadoo']
]
token_lists = []
for n in range(1000000):
    token_lists.extend(unique_token_lists)

Run the original code and the bitmask code

>>> wlook = WordLookerUpper(token_lists)
took 5 seconds to build dataframe
took 10 seconds to build bitmask

>>> wlook.filter_by_tokens(['foo','zoo']).tail(n=1)
took 4.67 seconds to find 3000000 matches using original code
id7999995   zoo None    None    None

>>> wlook.filter_by_tokens_with_bitmask(['foo','zoo']).tail(n=1)
took 0.30 seconds to find 3000000 matches using bitmask code
id7999995   zoo None    None    None

Upvotes: 0

Martin Elliott-Hunter
Martin Elliott-Hunter

Reputation: 11

perhaps my first answer was a bit abstract; in the absence of real data it was generating random data to the approx. volume reqd. to get a feel for the query time. This code is practical.

data =[['foo', 'bar', 'joe'],
       ['foo'],
       ['bar', 'joe'],
       ['zoo']]

wordlists = {}
print "build wordlists"
for x, d in enumerate(data):
    for word in d:
        try:
           wordlists[word].add(x)
        except:
           wordlists[word] = set([x])
print "query"
query = [ "foo", "zoo" ]
results = set()
for q in query:
    wordlist = wordlists.get(q)
    if  wordlist:
        results = results.union(wordlist)
l = list(results)
l.sort()
for x in l:
    print data[x]

The cost in time and memory is building the wordlists (inverted indices); query is almost free. You have 12 core machine so presumably it has plenty of memory. For repeatability, build the wordlists, pickle each wordlist and write to sqlite or any key/value database with the word as key and the pickled set as binary blob. then all you need is:

initialise_database()
query = [ "foo", "zoo" ]
results = set()                             
for q in query:                             
    wordlist = get_wordlist_from_database(q) # get binary blob and unpickle
    if  wordlist:                        
        results = results.union(wordlist)
l = list(results)
l.sort()   
for x in l:      
    print data[x]

alternatively, using arrays, which is more memory efficient, and probably faster to build index. pypy more that 10 x faster that 2.7

import array

data =[['foo', 'bar', 'joe'],
       ['foo'],
       ['bar', 'joe'],
       ['zoo']]

wordlists = {}
print "build wordlists"
for x, d in enumerate(data):
    for word in d:
        try:
           wordlists[word].append(x)
        except:
           wordlists[word] = array.array("i",[x])
print "query"
query = [ "foo", "zoo" ]
results = set()
for q in query:
    wordlist = wordlists.get(q)
    if  wordlist:
        for i in wordlist:
            results.add(i)
l = list(results)
l.sort()
for x in l:
    print data[x]

Upvotes: 0

Martin Elliott-Hunter
Martin Elliott-Hunter

Reputation: 11

You can do it with reverse index; the code below run in pypy builds the index in 57 seconds, does the query or 20 words takes 0.00018 seconds and used about 3.2Gb memory. Python 2.7 build index in 158 seconds and does query in 0.0013 seconds using about 3.41Gb memory.

The fastest possible way to do this is with bitmapped reversed indexes, compressed to save space.

"""
8m records with between 1 and 20 words each, selected at random from 100k words
Build dictionary of sets, keyed by word number, set contains nos of all records
with that word
query merges the sets for all query words
"""
import random
import time   records = 8000000
words = 100000
wordlists = {}
print "build wordlists"
starttime = time.time()
wordlimit = words - 1
total_words = 0
for recno in range(records):
    for x in range(random.randint(1,20)):
        wordno = random.randint(0,wordlimit)
        try:
           wordlists[wordno].add(recno)
        except:
           wordlists[wordno] = set([recno])
        total_words += 1
print "build time", time.time() - starttime, "total_words", total_words
querylist = set()
query = set()
for x in range(20):
    while 1:
        wordno = (random.randint(0,words))
        if  wordno in wordlists: # only query words that were used
            if  not wordno in query:
                query.add(wordno)
                break
print "query", query
starttime = time.time()
for wordno in query:
    querylist.union(wordlists[wordno])
print "query time", time.time() - starttime
print "count = ", len(querylist)
for recno in querylist:
    print "record", recno, "matches"

Upvotes: 1

piRSquared
piRSquared

Reputation: 294506

idea 0
zip

def pir(s, token):
    return s[[bool(p & token) for p in s]]

pir(s, {'foo', 'zoo'})

idea 1
merge

token = pd.DataFrame(dict(v=['foo', 'zoo']))
d1 = df.stack().reset_index('id', name='v')
s.ix[d1.merge(token).id.unique()]

idea 2
mul + any

d1 = pd.get_dummies(df.stack()).groupby(level=0).sum()
token = pd.Series(1, ['foo', 'zoo'])
s[d1.mul(token).any(1)]

idea 3
isin

d1 = df.stack()
s[d1.isin({'zoo', 'foo'}).unstack().any(1)]

idea 4
query

token = ('foo', 'zoo')
d1 = df.stack().to_frame('s')
s.ix[d1.query('s in @token').index.get_level_values(0).unique()]

Upvotes: 4

Tim Seed
Tim Seed

Reputation: 5279

I have done similar things with the following tools

Hbase - Key can have Multiple columns (Very Fast)
ElasticSearch - Nice easy to scale. You just need to import your data as JSON

Apache Lucene - Will be very good for 8 Million records

Upvotes: 1

Related Questions