Reputation: 31
I have to find an efficient python code to do the following:
Find, at least one if existing, sequence of n
consecutive elements which is included in the two given lists.
For example, with n=3
, the result for these two lists will be ['Tom', 'Sam', 'Jill']
:
lst1 = ['John', 'Jim', 'Tom', 'Sam', 'Jill', 'Chris']
lst2 = ['Chris', 'John', 'Tom', 'Sam', 'Jill', 'Jim']
The sample code below does the trick, but it will take forever to do the same if I have to compare hundreds of thousands of rows/lists. Any advice on how to optimize the performance of this code for processing a very large amount of data will be greatly appreciated!
lst1 = ['John', 'Jim', 'Tom', 'Sam', 'Jill', 'Chris']
lst2 = ['Chris', 'John', 'Tom', 'Sam', 'Jill', 'Jim']
strNum = 3 #represents number of consecutive strings to search for
common_element_found = 'False'
common_elements = []
lst1length = len(lst1) - (strNum - 1)
lst2length = len(lst2) - (strNum - 1)
for x in range(lst1length):
ConsecutiveStringX = lst1[x] + ' ' + lst1[x + 1] + ' ' + lst1[x + 2]
for y in range(lst2length):
ConsecutiveStringY = lst2[y] + ' ' + lst2[y + 1] + ' ' + lst2[y + 2]
if ConsecutiveStringY == ConsecutiveStringX:
common_element_found = 'True'
common_elements.append(ConsecutiveStringY)
print('Match found: ' + str(common_elements))
break
if common_element_found == 'True':
common_element_found = 'False'
break
Upvotes: 3
Views: 557
Reputation: 79238
You can use regular expresions:
import re
re.search("((?:\w+ ){3}).*\\1"," ".join(lst1)+","+" ".join(lst2)).group(1)
'Tom Sam Jill'
Upvotes: 0
Reputation: 4151
n = 3
stringlst2 = '#'.join(lst2)
for ngram in [lst1[i:i+n] for i in range(len(lst1)-n+1)]:
if '#'.join(ngram) in stringlst2:
print(ngram)
break
Upvotes: 0
Reputation: 103884
You can use sets:
>>> {tuple(lst1[i:i+3]) for i in range(0,len(lst1)-2)} & {tuple(lst2[i:i+3]) for i in range(0,len(lst2)-2)}
{('Tom', 'Sam', 'Jill')}
Upvotes: 0
Reputation: 59274
IIUC,
consecs1 = [ tuple(lst1[i:i+3]) for i in range(0, len(lst1)-2)]
consecs2 = { tuple(lst2[i:i+3]) for i in range(0, len(lst2)-2)}
for c in consecs1:
if c in consecs2:
print(c)
Output
('Tom', 'Sam', 'Jill')
Explanation: you can make a list of tuples
for lst1
, which are hashable objects, and check if they are in
the set
of tuples
built from lst2
(which grants O(1) speed).
PS: Even though sets are unorded, order is guaranteed because the loop follows lst1
and not lst2
ordering.
Upvotes: 4