Reputation: 12837
I have a list of strings: ls = ['a','b','c']
and another one, with larger strings, guaranteed to include one and only one strings from ls
: ls2 = ['1298a', 'eebbbd', 'qcqcq321']"
.
How can I find, for a given string from ls2
, what is the index of the corresponding string from ls
?
I can use:
for s in ls:
for ss in ls2:
if s in ss:
print (s,ss,ls.index(s))
a 1298a 0
b eebbbd 1
c qcqcq321 2
but it there something nicer?
EDIT (hope it clarifies):
The actual case I'm working on has a bigger 1st list, and a smaller 2nd:
ls = ['apo','b','c','d25','egg','f','g']
ls2 = ['apoip21', 'oiujohuid25']
and I want to get the result 0,3
because the 1st item in ls2
has the 1st item from ls
, while the 2nd in ls2
has the 4th in ls
Upvotes: 0
Views: 84
Reputation: 114548
It doesn't look like you can get away from O(m * n * p)
complexity (where m = len(ls)
, n = len(ls2)
, p = max(map(len, ls2))
) without further information about your data. You can definitely reduce your current loop from O(m2 * n * p)
by keeping track of the current index using enumerate
. Also, don't forget about early termination:
for string in ls2:
for index, key in enumerate(ls):
if key in string:
print(key, string, index)
break
Notice that I swapped the inner and outer loop to make the break
work properly: you definitely want to check each element of ls2
, but only the minimum number of elements in ls
.
Here are some timings I accumulated on the different O(m * n * p)
solutions presented here. Thanks to @thierry-lathuille for the test data:
ls = ['g', 'j', 'z', 'a', 'rr', 'ttt', 'b', 'c', 'd', 'f']
ls2 = ['1298a', 'eebbb', 'qcqcq321', 'mlkjmd', 'dùmlk', 'lof',
'erreee', 'bmw', 'ottt', 'jllll', 'lla' ]
def with_table():
table = {key: index for index, key in enumerate(ls)}
result = {}
for string in ls2:
for key in ls:
if key in string:
result[string] = table[key]
return result
def with_enumerate():
result = {}
for string in ls2:
for index, key in enumerate(ls):
if key in string:
result[string] = index
break
return result
def with_dict_comp():
return {string: index for string in ls2 for index, key in enumerate(ls) if key in string}
def with_itertools():
result = {}
for (index, key), string in itertools.product(enumerate(ls), ls2):
if key in string:
result[string] = index
return result
%timeit with_table()
4.89 µs ± 61.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit with_enumerate()
5.27 µs ± 66.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit with_dict_comp()
6.9 µs ± 83.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit with_itertools()
17.5 ns ± 0.193 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
As it turns out, creating a lookup table for the indices is slightly faster than computing them on the fly with enumerate
.
Upvotes: 2
Reputation: 24289
You could use these two improvements:
ls = ['a','b','c']
ls2 = ['1298a', 'eebbbd', 'qcqcq321']
# preprocess ls to avoid calling ls.index each time:
indices = {ss:index for index, ss in enumerate(ls)}
for s in ls2:
for ss in ls:
if ss in s:
print(ss, s, indices[ss])
# as s is guaranteed to include only one of the substrings,
# we don't have to test the other substrings once we found a match
break
# a 1298a 0
# b eebbbd 1
# c qcqcq321 2
Some timings:
Breaking out of the loop once the match has been found always improves the speed. The overhead due to the creation of the dict of indices makes it slower for very small lists, but is already faster with lists shorter than the ones used in the timings:
ls = ['g', 'j', 'z', 'a', 'rr', 'ttt', 'b', 'c', 'd', 'f']
ls2 = ['1298a', 'eebbb', 'qcqcq321', 'mlkjmd', 'dùmlk', 'lof', 'erreee', 'bmw', 'ottt', 'jllll', 'lla' ]
def original():
for s in ls:
for ss in ls2:
if s in ss:
out = (s,ss,ls.index(s))
def with_break():
for s in ls2:
for ss in ls:
if ss in s:
out = (ss, s, ls.index(ss))
# as s is guaranteed to include only one of the substrings,
# we don't have to test the other substrings once we found a match
break
def with_break_and_dict():
# preprocess ls to avoid calling ls.index each time:
indices = {ss:index for index, ss in enumerate(ls)}
for s in ls2:
for ss in ls:
if ss in s:
out = (ss, s, indices[ss])
# as s is guaranteed to include only one of the substrings,
# we don't have to test the other substrings once we found a match
break
Timing results:
%timeit original()
%timeit with_break()
%timeit with_break_and_dict()
# 100000 loops, best of 3: 12.8 µs per loop
# 100000 loops, best of 3: 9.5 µs per loop
# 100000 loops, best of 3: 8.49 µs per loop
Upvotes: 0
Reputation: 5027
So, if I interpret the "nicer" request as "written in a more compact way", then I suggest all loops and conditions packed in a list comprehension:
>>> ls = ['apo','b','c','d25','egg','f','g']
>>> ls2 = ['apoip21', 'oiujohuid25']
>>> [ls.index(s) for s in ls for s2 in ls2 if s in s2]
[0, 3]
But it does not improve the complexity, if "nicer" was to be understood as "less complex"...
Upvotes: 0
Reputation: 866
Using a dict comprehension:
ls = ['apo','b','c','d25','egg','f','g']
ls2 = ['apoip21', 'oiujohuid25']
result = { string : index for index, i in enumerate(ls) for string in ls2 if i in string }
# {'apoip21': 0, 'oiujohuid25': 3}
Upvotes: 0
Reputation: 5414
Your codes time complexity is O(n^4), you can make it O(n^3) by using dict.
ls = ['a','b','c']
ls2 = ['1298a', 'eebbbd', 'qcqcq321']
word_dict=dict()
for i in range(len(ls)): #time complexity O(n)
word_dict[ls[i]]=i
for s in ls: #O(n)
for ss in ls2: #O(n)
if s in ss: #O(n)
print(s,ss,word_dict[s]) #O(1)
Upvotes: 0