alvas
alvas

Reputation: 122142

How to optimize the combination of 2 lists of tuples and remove their duplicates?

From here, How do I remove element from a list of tuple if the 2nd item in each tuple is a duplicate? , I am able to remove the duplicate of the 2nd element in a tuple from 1 lists of tuple.

Let's say I have 2 lists of tuples:

alist = [(0.7897897,'this is a foo bar sentence'),
(0.653234, 'this is a foo bar sentence'),
(0.353234, 'this is a foo bar sentence'),
(0.325345, 'this is not really a foo bar'),
(0.323234, 'this is a foo bar sentence'),]

blist = [(0.64637,'this is a foo bar sentence'),
(0.534234, 'i am going to foo bar this sentence'),
(0.453234, 'this is a foo bar sentence'),
(0.323445, 'this is not really a foo bar')]

And I need to combine the score if the 2nd elements are the same (score_from_alist * score_from_blist) and achieve the desired output:

clist = [(0.51,'this is a foo bar sentence'), # 0.51 = 0.789 * 0.646
(0.201, 'this is not really a foo bar')] # 0.201  = 0.325 * 0.323

Currently, I'm achieving the clist by doing this, but it's taking 5+ seconds when my alist and blist have around 5500+ tuples, where the 2nd element has around 20-40 words each. Is there any way to make the following function faster?

def overlapMatches(alist, blist):
    start_time = time.time()
    clist = []
    overlap = set()
    for d in alist:
        for dn in blist:
            if d[1] == dn[1]:
                score = d[0]*dn[0]
                overlap.add((score,d[1]))
    for s in sorted(overlap, reverse=True)[:20]:
        clist.append((s[0],s[1]))
    print "overlapping matches takes", time.time() - start_time 
    return clist

Upvotes: 2

Views: 246

Answers (2)

jfs
jfs

Reputation: 414585

To leave the tuple with the highest 1st item in presence of duplicates within a single list assuming it is sorted in descending order by the 1st item in a tuple, and to combine the score from the two lists if corresponding 2nd items in a tuple are the same:

# remove duplicates (take the 1st item among duplicates)
a, b = [{sentence: score for score, sentence in reversed(lst)}
        for lst in [alist, blist]]

# merge (leave only tuples that have common 2nd items (sentences))
clist = [(a[s]*b[s], s) for s in a.viewkeys() & b.viewkeys()]
clist.sort(reverse=True) # sort by (score, sentence) in descending order
print(clist)

Output:

[(0.510496368389, 'this is a foo bar sentence'),
 (0.10523121352499999, 'this is not really a foo bar')]

Upvotes: 1

NPE
NPE

Reputation: 500673

I would use dictionaries/sets to both eliminate duplicates and provide fast lookups:

alist = [(0.7897897,'this is a foo bar sentence'),
(0.653234, 'this is a foo bar sentence'),
(0.353234, 'this is a foo bar sentence'),
(0.325345, 'this is not really a foo bar'),
(0.323234, 'this is a foo bar sentence'),]

blist = [(0.64637,'this is a foo bar sentence'),
(0.534234, 'i am going to foo bar this sentence'),
(0.453234, 'this is a foo bar sentence'),
(0.323445, 'this is not really a foo bar')]

bdict = {k:v for v,k in reversed(blist)}
clist = []
cset = set()
for v,k in alist:
   if k not in cset:
      b = bdict.get(k, None)
      if b is not None:
        clist.append((v * b, k))
        cset.add(k)
print(clist)

Here, blist is used to eliminate all but the first appearance of each sentence, and to provide fast lookups by sentence.

If you don't care about the ordering of clist, you can simplify the structures somewhat:

bdict = {k:v for v,k in reversed(blist)}
cdict = {}
for v,k in alist:
   if k not in cdict:
      b = bdict.get(k, None)
      if b is not None:
        cdict[k] = v * b
print(list((k,v) for v,k in cdict.items()))

Upvotes: 3

Related Questions