chuky pedro
chuky pedro

Reputation: 745

faster to check if an element in a pandas series exist in a list of list

What would be the easiest and a faster way of checking if an element in a series exist in a list of list. For Example, i have a series and a list of lists as follows ? I have a loop that does exactly this but it is a bit slow, so I want a faster way to do this.

groups = []
for desc in descs: 
    for i in range(len(list_of_list)):
        if desc in list_of_list[i]:
            groups.append(i)

list_of_list = [['rfnd sms chrgs'],
 ['loan payment receipt'],
 ['zen june2018 aksg sal 1231552',
  'zen july2018 aksg sal 1411191',
  'zen aug2018 aksg mda sal 16014'],
 ['cshw agnes john udo mrs ',
  'cshw agnes john udo',
  'cshw agnes udo',
  'cshw agnes john'],
 ['sms alert charge outstanding'],
 ['maint fee recovery jul 2018', 'vat maint fee recovery jul 2018'],
 ['sept2018 aksg mda sal 20028',
  'oct2018 aksg mda sal 21929',
  'nov2018 aksg mda sal 25229'],
 ['sms alert charges 28th sep 26th oct 2018']]

descs = 

1959                            rfnd sms chrgs
1960        loan payment receipt
1961                zen june2018 aksg sal 1231552
1962        loan payment receipt
1963                     cshw agnes john udo mrs 
1964        maint fee frm 31 may 2018 28 jun 2018
1965    vat maint fee frm 31 may 2018 28 jun 2018
1966                 sms alert charge outstanding
1967        loan payment receipt
1968                zen july2018 aksg sal 1411191
1969        loan payment receipt

Expected output is like a list of numbers

e.g [1,2,3,4,5,6]

Upvotes: 1

Views: 204

Answers (1)

Corralien
Corralien

Reputation: 120409

Prepare your data:

# merge a series without a name is not allowed
descs = descs.rename("descs")

# convert list of lists to a series
ll = pd.Series(list_of_list).explode().reset_index()
ll.columns = ["pos", "descs"]
>>> descs
1959                           rfnd sms chrgs
1960                     loan payment receipt
1961            zen june2018 aksg sal 1231552
1962                     loan payment receipt
1963                 cshw agnes john udo mrs
1964    maint fee frm 31 may 2018 28 jun 2018
1965    maint fee frm 31 may 2018 28 jun 2018
1966             sms alert charge outstanding
1967                     loan payment receipt
1968            zen july2018 aksg sal 1411191
1969                     loan payment receipt
Name: descs, dtype: object

>>> ll
    pos                                     descs
0     0                            rfnd sms chrgs
1     1                      loan payment receipt
2     2             zen june2018 aksg sal 1231552
3     2             zen july2018 aksg sal 1411191
4     2            zen aug2018 aksg mda sal 16014
5     3                  cshw agnes john udo mrs
6     3                       cshw agnes john udo
7     3                            cshw agnes udo
8     3                           cshw agnes john
9     4              sms alert charge outstanding
10    5               maint fee recovery jul 2018
11    5           vat maint fee recovery jul 2018
12    6               sept2018 aksg mda sal 20028
13    6                oct2018 aksg mda sal 21929
14    6                nov2018 aksg mda sal 25229
15    7  sms alert charges 28th sep 26th oct 2018

Now you can merge descs and ll to get your list of numbers:

df = pd.merge(descs, ll, on="descs", how="left").set_index(descs.index)
>>> df
                                      descs  pos
1959                         rfnd sms chrgs  0.0
1960                   loan payment receipt  1.0
1961          zen june2018 aksg sal 1231552  2.0
1962                   loan payment receipt  1.0
1963               cshw agnes john udo mrs   3.0
1964  maint fee frm 31 may 2018 28 jun 2018  NaN
1965  maint fee frm 31 may 2018 28 jun 2018  NaN
1966           sms alert charge outstanding  4.0
1967                   loan payment receipt  1.0
1968          zen july2018 aksg sal 1411191  2.0
1969                   loan payment receipt  1.0

Check:

>>> df.loc[1966, "descs"]
'sms alert charge outstanding'

>>> list_of_list[int(df.loc[1966, "pos"])]
['sms alert charge outstanding']

Another method:

This method takes advantage of Categorical data type. It could be faster.

>>> ll = pd.Series(list_of_list).explode()
>>> descs.astype("category").map(pd.Series(ll.index, index=ll.astype("category")))
1959    0.0
1960    1.0
1961    2.0
1962    1.0
1963    3.0
1964    NaN
1965    NaN
1966    4.0
1967    1.0
1968    2.0
1969    1.0
dtype: float64

Upvotes: 2

Related Questions