Reputation: 621

Iteration over a list in a Pandas DataFrame column

I have a dataframe df as this one:

                                                  my_list
Index                                                                
0                                               [81310, 81800]
1                                                      [82160]
2            [75001, 75002, 75003, 75004, 75005, 75006, 750...
3                                                      [95190]
4                                               [38170, 38180]
5                                                      [95240]
6                                                      [71150]
7                                                      [62520]

I have a list named code with at least one element.

code = ['75008', '75015']

I want to create another column in my DataFrame named my_min, containing the minimum absolute difference between each element of the list code and the list from df.my_list.

Here are the commands I tried :

df.loc[:, 'my_list'] = min([abs(int(x)-int(y)) for x in code for y in df.loc[:, 'my_list'].str[:]])
>>> TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

#or

df.loc[:, 'my_list'] = min([abs(int(x)-int(y)) for x in code for y in df.loc[:, 'my_list']])
>>> TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

#or

df.loc[:, 'my_list'] = min([abs(int(x)-int(y)) for x in code for y in df.loc[:, 'my_list'].tolist()])
>>> TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

#or

df.loc[:, 'my_list'] = min([abs(int(x)-int(y)) for x in code for y in z for z in df.loc[:, 'my_list'].str[:]])
>>> UnboundLocalError: local variable 'z' referenced before assignment

#or

df.loc[:, 'my_list'] = min([abs(int(x)-int(y)) for x in code for y in z for z in df.loc[:, 'my_list']])
>>> UnboundLocalError: local variable 'z' referenced before assignment

#or

df.loc[:, 'my_list'] = min([abs(int(x)-int(y)) for x in code for y in z for z in df.loc[:, 'my_list'].tolist()])
>>> UnboundLocalError: local variable 'z' referenced before assignment

Upvotes: 1

Answers (3)

Quang Hoang

Reputation: 150735

If you have pandas 0.25+ you can use explode and combine with np.min:

# sample data
df = pd.DataFrame({'my_list':
                  [[81310, 81800], [82160], [75001,75002]]})
code = ['75008', '75015']

# concatenate the lists into one series
s = df.my_list.explode()

# convert `code` into np.array
code = np.array(code, dtype=int)

# this is the output series
pd.Series(np.min(np.abs(s.values[:,None] - code),axis=1), 
          index=s.index).min(level=0)

Output:

0    6295
1    7145
2       6
dtype: int64

Upvotes: 0

J_H

Reputation: 20450

Write a helper: def find_min(lst): -- it is clear you know how to do that. The helper will consult a global named code.

Then apply it:

df['my_min'] = df.my_list.apply(find_min)

The advantage of breaking out a helper is you can write separate unit tests for it.

If you prefer to avoid globals, you will find partial quite helpful. https://docs.python.org/3/library/functools.html#functools.partial

Upvotes: 1

C8H10N4O2

Reputation: 19005

you could do this with a list comprehension:

import pandas as pd
import numpy as np
df = pd.DataFrame({'my_list':[[81310, 81800],[82160]]})

code = ['75008', '75015']

pd.DataFrame({'my_min':[min([abs(int(i) - j) for i in code for j in x]) 
              for x in df.my_list]})

returns

   my_min
0    6295
1    7145

You could also use pd.Series.apply instead of the outer list, for example:

df.my_list.apply(lambda x: min([abs(int(i) - j) for i in code for j in x]) )

Upvotes: 1

Iteration over a list in a Pandas DataFrame column

Answers (3)

Related Questions