Reputation: 745
What would be the easiest and a faster way of checking if an element in a series exist in a list of list. For Example, i have a series and a list of lists as follows ? I have a loop that does exactly this but it is a bit slow, so I want a faster way to do this.
groups = []
for desc in descs:
for i in range(len(list_of_list)):
if desc in list_of_list[i]:
groups.append(i)
list_of_list = [['rfnd sms chrgs'],
['loan payment receipt'],
['zen june2018 aksg sal 1231552',
'zen july2018 aksg sal 1411191',
'zen aug2018 aksg mda sal 16014'],
['cshw agnes john udo mrs ',
'cshw agnes john udo',
'cshw agnes udo',
'cshw agnes john'],
['sms alert charge outstanding'],
['maint fee recovery jul 2018', 'vat maint fee recovery jul 2018'],
['sept2018 aksg mda sal 20028',
'oct2018 aksg mda sal 21929',
'nov2018 aksg mda sal 25229'],
['sms alert charges 28th sep 26th oct 2018']]
descs =
1959 rfnd sms chrgs
1960 loan payment receipt
1961 zen june2018 aksg sal 1231552
1962 loan payment receipt
1963 cshw agnes john udo mrs
1964 maint fee frm 31 may 2018 28 jun 2018
1965 vat maint fee frm 31 may 2018 28 jun 2018
1966 sms alert charge outstanding
1967 loan payment receipt
1968 zen july2018 aksg sal 1411191
1969 loan payment receipt
Expected output is like a list of numbers
e.g [1,2,3,4,5,6]
Upvotes: 1
Views: 204
Reputation: 120409
Prepare your data:
# merge a series without a name is not allowed
descs = descs.rename("descs")
# convert list of lists to a series
ll = pd.Series(list_of_list).explode().reset_index()
ll.columns = ["pos", "descs"]
>>> descs
1959 rfnd sms chrgs
1960 loan payment receipt
1961 zen june2018 aksg sal 1231552
1962 loan payment receipt
1963 cshw agnes john udo mrs
1964 maint fee frm 31 may 2018 28 jun 2018
1965 maint fee frm 31 may 2018 28 jun 2018
1966 sms alert charge outstanding
1967 loan payment receipt
1968 zen july2018 aksg sal 1411191
1969 loan payment receipt
Name: descs, dtype: object
>>> ll
pos descs
0 0 rfnd sms chrgs
1 1 loan payment receipt
2 2 zen june2018 aksg sal 1231552
3 2 zen july2018 aksg sal 1411191
4 2 zen aug2018 aksg mda sal 16014
5 3 cshw agnes john udo mrs
6 3 cshw agnes john udo
7 3 cshw agnes udo
8 3 cshw agnes john
9 4 sms alert charge outstanding
10 5 maint fee recovery jul 2018
11 5 vat maint fee recovery jul 2018
12 6 sept2018 aksg mda sal 20028
13 6 oct2018 aksg mda sal 21929
14 6 nov2018 aksg mda sal 25229
15 7 sms alert charges 28th sep 26th oct 2018
Now you can merge descs
and ll
to get your list of numbers:
df = pd.merge(descs, ll, on="descs", how="left").set_index(descs.index)
>>> df
descs pos
1959 rfnd sms chrgs 0.0
1960 loan payment receipt 1.0
1961 zen june2018 aksg sal 1231552 2.0
1962 loan payment receipt 1.0
1963 cshw agnes john udo mrs 3.0
1964 maint fee frm 31 may 2018 28 jun 2018 NaN
1965 maint fee frm 31 may 2018 28 jun 2018 NaN
1966 sms alert charge outstanding 4.0
1967 loan payment receipt 1.0
1968 zen july2018 aksg sal 1411191 2.0
1969 loan payment receipt 1.0
Check:
>>> df.loc[1966, "descs"]
'sms alert charge outstanding'
>>> list_of_list[int(df.loc[1966, "pos"])]
['sms alert charge outstanding']
Another method:
This method takes advantage of Categorical data type. It could be faster.
>>> ll = pd.Series(list_of_list).explode()
>>> descs.astype("category").map(pd.Series(ll.index, index=ll.astype("category")))
1959 0.0
1960 1.0
1961 2.0
1962 1.0
1963 3.0
1964 NaN
1965 NaN
1966 4.0
1967 1.0
1968 2.0
1969 1.0
dtype: float64
Upvotes: 2