ah bon
ah bon

Reputation: 10011

Split string column and extract second part in Python

Let's say I have a dataframe as follows:

df = pd.DataFrame({"id": range(4), "price": ["15dollar/m2/day", "90dollar/m2/month", "18dollar/m2/day", "100dollar/m2/month"]})

       id               price
    0   0     15dollar/m2/day
    1   1   90dollar/m2/month
    2   2     18dollar/m2/day
    3   3  100dollar/m2/month

I would like to split column price into two new columns: unit_price and price_unit as below:

   id     unit_price  price_unit
0   0        15.0    dollar/m2/day
1   1        90.0    dollar/m2/month
2   2        18.0    dollar/m2/day
3   3       100.0    dollar/m2/month

This is my solution:

df['unit_price'] = df['price'].str.split('dollar').str[0].astype(float)
#df['unit_price'] = df['price'].str.extract('(\d*\.\d+|\d+)', expand=False).astype(float)
df['price_unit'] = df['price'].str.split('dollar').str[1]
del df['price']

For the column unit_price, it works fine, but for price_unit, when i split by dollar, I got result as below which doesn't including character dollar, or if I use df['price'].str.replace(r'\d', ''), all numbers were removed. How could I do it correctly in Python? Thanks.

df['price_unit']
Out[474]: 
0      /m2/day
1    /m2/month
2      /m2/day
3    /m2/month
Name: price_unit, dtype: object 

Upvotes: 1

Views: 1526

Answers (2)

Aravind Voggu
Aravind Voggu

Reputation: 1531

If it's just a string, why not keep it simple and add the dollar part yourself?

df['price_unit'] = 'dollar' + df['price'].str.split('dollar').str[1]

import pandas as pd

df = pd.DataFrame({"id": range(4), "price": ["15dollar/m2/day", "90dollar/m2/month", "18dollar/m2/day", "100dollar/m2/month"]})

df['unit_price'] = df['price'].str.split('dollar').str[0].astype(float)
df['price_unit'] = 'dollar' + df['price'].str.split('dollar').str[1]

del df['price']

print(df)

   id  unit_price       price_unit
0   0        15.0    dollar/m2/day
1   1        90.0  dollar/m2/month
2   2        18.0    dollar/m2/day
3   3       100.0  dollar/m2/month

Upvotes: 0

jezrael
jezrael

Reputation: 862601

You can use Series.str.extract with regex - ^ for start of string, \d*\.\d+ for floats or \d+ for integers and then for all another values by .*:

df = df.join(df.pop('price').str.extract('(?P<unit_price>^\d*\.\d+|^\d+)(?P<price_unit>.*)'))
print (df)
   id unit_price       price_unit
0   0         15    dollar/m2/day
1   1         90  dollar/m2/month
2   2         18    dollar/m2/day
3   3        100  dollar/m2/month

First solution is using extract and replace by numeric:

pat = '(^\d*\.\d+|^\d+)'
df['unit_price'] = df['price'].str.extract(pat, expand=False)
df['price_unit'] = df.pop('price').str.replace(pat,'')
print (df)
   id unit_price       price_unit
0   0         15    dollar/m2/day
1   1         90  dollar/m2/month
2   2         18    dollar/m2/day
3   3        100  dollar/m2/month

Upvotes: 3

Related Questions