Reputation: 83
Using str.extract()
to extract numbers from strings works (see below), but it's time-consuming when I have to repeat this on 30+ columns.
Originally like this:
yyyy
2014 110.7438016528926\K
2015 103.7533512064343\O
2016 111.6531165311653\L
2017 nan\L
2018 nan\E
Name: dips_cert_earning_premium_weekly, dtype: object
Then I wrote these codes:
df['dips_cert_earning_premium_weekly'] = df['dips_cert_earning_premium_weekly'].str.extract(r'(\d+(?:\.\d+)?)').astype(float)
Output:
yyyy
2014 110.743802
2015 103.753351
2016 111.653117
2017 NaN
2018 NaN
Name: dips_cert_earning_premium_weekly, dtype: float64
Is there a more efficient way to do this?
Upvotes: 0
Views: 247
Reputation: 391
I'm not sure how your whole dataset looks like or what you mean by slow: is it taking minutes? seconds? hours?. However, if it's possible you could try an approach not using regexp. For instance:
letters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\\"
def remove_letters(x):
try:
return float(x.strip(letters))
except:
return None
df['dips_cert_earning_premium_weekly'] = list(map(remove_letters, df['dips_cert_earning_premium_weekly']))
If the dataset only has one column with 31 rows, your version takes 0.0008417534828186036 seconds while the version I propose takes 0.0003292644023895264 seconds so it's a slight difference but it's worth trying. (I calculated this times by taking the times for N=1000 runs and then I took the average of all times)
Otherwise, try to profile your code to see where the bottleneck is and try different implementations until you find the fastest one.
Edit:
To fully answer the question with the new data: To iterate over the dataframe and apply this to each column, you can try the following code:
df = pd.DataFrame.from_dict(data)
for (columnName, columnData) in df.iteritems():
df[columnName] = list(map(remove_letters, df[columnName])) # This is the same line from before. Only difference is that we are applying it to all columns.
Upvotes: 2