Reputation: 115
I would like to extract a specific substring from a string value in a column. In the code below you'll see that I've managed to extract the year, make and model.
import pandas as pd
import numpy as np
data = [['year_make_model'],
['1970 Plymouth Hemi Cuda'],
['1970 Dodge Challenger R/T SE Convertible'],
['1971 Plymouth Cuda']]
data = pd.DataFrame(data[1:], columns=data[0])
def create_year_make_model(df):
df['year'] = df.year_make_model.str.extract('(\d\d\d\d)', expand=True)
df['make'] = np.where(df.year_make_model.str.contains('Plymouth'), 'Plymouth',
np.where(df.year_make_model.str.contains('Dodge'), 'Dodge', 'Nan'))
df['model'] = np.where(df.year_make_model.str.contains('Cuda'), 'Cuda',
np.where(df.year_make_model.str.contains('Challenger'), 'Challenger', 'Nan'))
return df.head()
create_year_make_model(data)
year_make_model year make model
0 1970 Plymouth Hemi Cuda 1970 Plymouth Cuda
1 1970 Dodge Challenger R/T SE Convertible 1970 Dodge Challenger
2 1971 Plymouth Cuda 1971 Plymouth Cuda
Now I need to extract all the remaining words (if there are any). The remaining words make up the vehicle 'type' (see example below).
car_type = ['Hemi', 'R/T SE Convertible', None]
data['type'] = car_type
data
year_make_model year make model type
0 1970 Plymouth Hemi Cuda 1970 Plymouth Cuda Hemi
1 1970 Dodge Challenger R/T SE Convertible 1970 Dodge Challenger R/T SE Convertible
2 1971 Plymouth Cuda 1971 Plymouth Cuda None
I've tried many different approaches so far without luck. One approach that I've tried is creating a list for all the year, make and model combinations word_list = ['Cuda','Challenger','1970','1971','Dodge','Plymouth']
. I then made a for loop trying to extract the words in column 'year_make_model' which are not in the word_list but I couldn't get it to work. I hope someone knows a need trick to solve my problem, thanks in advance!
Upvotes: 0
Views: 120
Reputation: 127
I'm not sure whether this is the best way to solve it, but it does seem to work:
import numpy as np
import pandas as pd
data = [['year_make_model'],
['1970 Plymouth Hemi Cuda'],
['1970 Dodge Challenger R/T SE Convertible'],
['1971 Plymouth Cuda']]
data = pd.DataFrame(data[1:], columns=data[0])
def create_year_make_model(df):
cell_text = df.year_make_model.str
df['year'] = cell_text.extract('(\d\d\d\d)', expand=True)
df['make'] = np.where(cell_text.contains('Plymouth'), 'Plymouth',
np.where(cell_text.contains('Dodge'), 'Dodge', 'Nan'))
df['model'] = np.where(cell_text.contains('Cuda'), 'Cuda',
np.where(cell_text.contains('Challenger'), 'Challenger', 'Nan'))
# what fields to remove
rm_fields = ["year", "make", "model"]
for field in rm_fields:
# remove this field
for val in getattr(df, field).values:
cell_text = cell_text.replace(str(val), "").str
# clean any leading/trailing spaces
cell_text = cell_text.replace(r"^\s+", "")
cell_text = cell_text.replace(r"\s+$", "")
df['type'] = cell_text
return df.head()
print(create_year_make_model(data))
So after you have extracted the year, make and model fields you can go back and get the extracted values for those fields and do replace them with spaces.
Also any possible leading/trailing whitespace should be removed for consistency.
Upvotes: 1