Reputation: 4421

Indices of matching elements given two lists of which one has redundant entries

I've got two lists, a and b. a contains elements for which I would like to know the indices of matching elements in b. In b, every element is unique, unlike in a.

a = [1993, 1993, 1994, 1995, 1996, 1996, 1998, 2003, 2005, 2005]
b = [1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]

Using the solution from Finding the indices of matching elements in list in Python:

matching = [match for match, element in enumerate(b) if element in a]

matching however is only [27, 28, 29, 30, 32, 37, 39], but I expect it to be [27, 27, 28, 29, 30, 30, 32, 37, 39, 39].

Upvotes: 2

Answers (4)

RoadieRich

Reputation: 6556

This is an expansion on Padraic Cunningham's suggestion to use sets. If instead you convert the list you're indexing off of into a dictionary, you can achieve O(1) lookup, for O(n) preprocessing:

a = [1993, 1993, 1994, 1995, 1996, 1996, 1998, 2003, 2005, 2005]
b = [1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]

d = {value: index for index, value in enumerate(b)}
print([d[x] for x in a])



>>> timeit("[bisect_left(b, x) for x in a]", "from __main__ import a, b; from bisect import bisect_left")
3.513558427607279
>>> timeit("[b.index(x) for x in a]", "from __main__ import a, b")
8.010070997323822
>>> timeit("d = {value: index for index, value in enumerate(b)}; [d[x] for x in a]", "from __main__ import a, b")
5.5277420695707065
>>> timeit("[d[x] for x in a]", "from __main__ import a, b, ;d = {value : index for index, value in enumerate(b)}")
1.1214096146165389

So, if you discount the preprocessing, you're almost 8 times faster than using b.index in the actual processing - which is better if you're doing lots of list a's against fewer b's. Using bisect_left is faster if you're only doing it once, and can guarantee that b is monotonically ascending.

Upvotes: 1

Padraic Cunningham

Reputation: 180391

If you have large lists making b a set will be more efficient:

st = set(b)
print([b.index(x) for x in a if x in st])

As your data is sorted and presuming all elements from a are in b you can also use bisect so each index lookup is O(log n):

a = [1993, 1993, 1994, 1995, 1996, 1996, 1998, 2003, 2005, 2005]
b = [1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]


from bisect import bisect_left
print [bisect_left(b, x) for x in a]
[27, 27, 28, 29, 30, 30, 32, 37, 39, 39]

On the small dataset it runs twice as fast as just indexing:

In [22]: timeit [bisect_left(b, x) for x in a]
100000 loops, best of 3: 4.2 µs per loop

In [23]: timeit [b.index(x) for x in a]
100000 loops, best of 3: 8.84 µs per loop

Another option would be to use a dict to store the indexes which would mean the code would run in linear time, one pass over a and one pass over b:

# store all indexes as values and years as keys
indexes = {k: i for i, k in enumerate(b)}
# one pass over a accessing each index in constant time
print [indexes[x] for x in a]
[27, 27, 28, 29, 30, 30, 32, 37, 39, 39]

Which even on the small input set is a bit more efficient than indexing and as the a grows would be a lot more efficient:

In [34]: %%timeit                                                            
indexes = {k: i for i, k in enumerate(b)}
[indexes[x] for x in a]
   ....: 
100000 loops, best of 3: 7.54 µs per loop

In [39]: b = list(range(1966,2100))
In [40]: samp = list(range(1966,2100))
In [41]: a = [choice(samp) for _ in range(100)]

In [42]: timeit [b.index(x) for x in a 
10000 loops, best of 3: 154 µs per loop   
In [43]: %%timeit                      
indexes = {k: i for i, k in enumerate(b)}
[indexes[x] for x in a]
   ....: 
10000 loops, best of 3: 22.5 µs per loop

Upvotes: 0

Paul Rooney

Reputation: 21609

What about

print [b.index(i) for i in a if i in b]

Upvotes: 1

Lennart Regebro

Reputation: 172229

>>> a = [1993, 1993, 1994, 1995, 1996, 1996, 1998, 2003, 2005, 2005]
>>> b = [1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]
>>> [b.index(x) for x in a]
[27, 27, 28, 29, 30, 30, 32, 37, 39, 39]

Upvotes: 6

Indices of matching elements given two lists of which one has redundant entries

Answers (4)

Related Questions