Reputation: 122142
From here, How do I remove element from a list of tuple if the 2nd item in each tuple is a duplicate? , I am able to remove the duplicate of the 2nd element in a tuple from 1 lists of tuple.
Let's say I have 2 lists of tuples:
alist = [(0.7897897,'this is a foo bar sentence'),
(0.653234, 'this is a foo bar sentence'),
(0.353234, 'this is a foo bar sentence'),
(0.325345, 'this is not really a foo bar'),
(0.323234, 'this is a foo bar sentence'),]
blist = [(0.64637,'this is a foo bar sentence'),
(0.534234, 'i am going to foo bar this sentence'),
(0.453234, 'this is a foo bar sentence'),
(0.323445, 'this is not really a foo bar')]
And I need to combine the score if the 2nd elements are the same (score_from_alist * score_from_blist) and achieve the desired output:
clist = [(0.51,'this is a foo bar sentence'), # 0.51 = 0.789 * 0.646
(0.201, 'this is not really a foo bar')] # 0.201 = 0.325 * 0.323
Currently, I'm achieving the clist by doing this, but it's taking 5+ seconds when my alist and blist have around 5500+ tuples, where the 2nd element has around 20-40 words each. Is there any way to make the following function faster?
def overlapMatches(alist, blist):
start_time = time.time()
clist = []
overlap = set()
for d in alist:
for dn in blist:
if d[1] == dn[1]:
score = d[0]*dn[0]
overlap.add((score,d[1]))
for s in sorted(overlap, reverse=True)[:20]:
clist.append((s[0],s[1]))
print "overlapping matches takes", time.time() - start_time
return clist
Upvotes: 2
Views: 246
Reputation: 414585
To leave the tuple with the highest 1st item in presence of duplicates within a single list assuming it is sorted in descending order by the 1st item in a tuple, and to combine the score from the two lists if corresponding 2nd items in a tuple are the same:
# remove duplicates (take the 1st item among duplicates)
a, b = [{sentence: score for score, sentence in reversed(lst)}
for lst in [alist, blist]]
# merge (leave only tuples that have common 2nd items (sentences))
clist = [(a[s]*b[s], s) for s in a.viewkeys() & b.viewkeys()]
clist.sort(reverse=True) # sort by (score, sentence) in descending order
print(clist)
Output:
[(0.510496368389, 'this is a foo bar sentence'),
(0.10523121352499999, 'this is not really a foo bar')]
Upvotes: 1
Reputation: 500673
I would use dictionaries/sets to both eliminate duplicates and provide fast lookups:
alist = [(0.7897897,'this is a foo bar sentence'),
(0.653234, 'this is a foo bar sentence'),
(0.353234, 'this is a foo bar sentence'),
(0.325345, 'this is not really a foo bar'),
(0.323234, 'this is a foo bar sentence'),]
blist = [(0.64637,'this is a foo bar sentence'),
(0.534234, 'i am going to foo bar this sentence'),
(0.453234, 'this is a foo bar sentence'),
(0.323445, 'this is not really a foo bar')]
bdict = {k:v for v,k in reversed(blist)}
clist = []
cset = set()
for v,k in alist:
if k not in cset:
b = bdict.get(k, None)
if b is not None:
clist.append((v * b, k))
cset.add(k)
print(clist)
Here, blist
is used to eliminate all but the first appearance of each sentence, and to provide fast lookups by sentence.
If you don't care about the ordering of clist
, you can simplify the structures somewhat:
bdict = {k:v for v,k in reversed(blist)}
cdict = {}
for v,k in alist:
if k not in cdict:
b = bdict.get(k, None)
if b is not None:
cdict[k] = v * b
print(list((k,v) for v,k in cdict.items()))
Upvotes: 3