Ilovenoodles
Ilovenoodles

Reputation: 83

How to extract numbers from strings on ALL the columns in python

Using str.extract() to extract numbers from strings works (see below), but it's time-consuming when I have to repeat this on 30+ columns.

Originally like this:

yyyy
2014    110.7438016528926\K
2015    103.7533512064343\O
2016    111.6531165311653\L
2017                  nan\L
2018                  nan\E
Name: dips_cert_earning_premium_weekly, dtype: object

Then I wrote these codes:

df['dips_cert_earning_premium_weekly'] = df['dips_cert_earning_premium_weekly'].str.extract(r'(\d+(?:\.\d+)?)').astype(float)

Output:

yyyy
2014    110.743802
2015    103.753351
2016    111.653117
2017           NaN
2018           NaN
Name: dips_cert_earning_premium_weekly, dtype: float64

Is there a more efficient way to do this?

Upvotes: 0

Views: 247

Answers (1)

Roxy
Roxy

Reputation: 391

I'm not sure how your whole dataset looks like or what you mean by slow: is it taking minutes? seconds? hours?. However, if it's possible you could try an approach not using regexp. For instance:

letters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\\"
def remove_letters(x):
    try:
        return float(x.strip(letters))
    except:
        return None

df['dips_cert_earning_premium_weekly'] = list(map(remove_letters,  df['dips_cert_earning_premium_weekly']))

If the dataset only has one column with 31 rows, your version takes 0.0008417534828186036 seconds while the version I propose takes 0.0003292644023895264 seconds so it's a slight difference but it's worth trying. (I calculated this times by taking the times for N=1000 runs and then I took the average of all times)

Otherwise, try to profile your code to see where the bottleneck is and try different implementations until you find the fastest one.

Edit:

To fully answer the question with the new data: To iterate over the dataframe and apply this to each column, you can try the following code:

df = pd.DataFrame.from_dict(data)
for (columnName, columnData) in df.iteritems():
   df[columnName] = list(map(remove_letters,  df[columnName])) # This is the same line from before. Only difference is that we are applying it to all columns.

Upvotes: 2

Related Questions