pythonabsir
pythonabsir

Reputation: 49

pandas better runtime, going trough dataframe

I have a pandas dataframe, there I wanna search in one column for numbers, find it and put it in a new column.

import pandas
import regex as re
import numpy as np
data = {'numbers':['134.ABBC,189.DREB, 134.TEB', '256.EHBE, 134.RHECB, 345.DREBE', '456.RHN,256.REBN,864.TREBNSE', '256.DREB, 134.ETNHR,245.DEBHTECM'],
        'rate':[434, 456, 454256, 2334544]}
  

df = pd.DataFrame(data)
  

print(df)

pattern = '134.[A-Z]{2,}'


df['mynumbers'] = None


index_numbers = df.columns.get_loc('numbers')
index_mynumbers = df.columns.get_loc('mynumbers')


length = np.array([])


for row in range(0, len(df)):
    number = re.findall(pattern, df.iat[row, index_numbers])

    
    
    df.iat[row, index_mynumbers] = number


print(df)


I get my numbers: {'mynumbers': ['[134.ABBC, 134.TEB]', '[134.RHECB]', '[134.RHECB]']}. My dataframe is huge. Is there a better, faster method in pandas for going trough my df?

Upvotes: 0

Views: 166

Answers (1)

jezrael
jezrael

Reputation: 862841

Sure, use Series.str.findall instead loops:

pattern = '134.[A-Z]{2,}'
df['mynumbers'] = df['numbers'].str.findall(pattern)
print(df)
                            numbers     rate            mynumbers
0        134.ABBC,189.DREB, 134.TEB      434  [134.ABBC, 134.TEB]
1    256.EHBE, 134.RHECB, 345.DREBE      456          [134.RHECB]
2      456.RHN,256.REBN,864.TREBNSE   454256                   []
3  256.DREB, 134.ETNHR,245.DEBHTECM  2334544          [134.ETNHR]

If want using re.findall is it possible, only 2 times slowier:

pattern = '134.[A-Z]{2,}'
df['mynumbers'] = df['numbers'].map(lambda x: re.findall(pattern, x))

# [40000 rows]
df = pd.concat([df] * 10000, ignore_index=True)

pattern = '134.[A-Z]{2,}'

In [46]: %timeit df['numbers'].map(lambda x: re.findall(pattern, x))
50 ms ± 491 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [47]: %timeit df['numbers'].str.findall(pattern)
21.2 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 4

Related Questions