Reputation: 47
I have a CSV file with a column that contains a string with vehicle options.
Brand Options Toyota Color:Black,Wheels:18 Toyota Color:Black, Chevy Color:Red,Wheels:16,Style:18 Honda Color:Green,Personalization:"Customer requested detailing" Chevy Color:Black,Wheels:16
I want to expand the "Options" string to new columns with the appropriate name. The dataset is considerably large so I am trying to name the columns programmatically (ie: Color, Wheels, Personalization) then apply the respective value to the row or a null value.
Adding new data
import pandas as pd
Cars = pd.read_csv("Cars.CSV") # Loads cars into df
split = Cars["Options"].str.split(",", expand = True) # Data in form of {"Color:Black", "Wheels:16"}
split[0][0].split(":") # returns ['Color', 'Black']
What is an elegant way to concat these lists to the original dataframe Cars without specified columns?
Upvotes: 2
Views: 38
Reputation: 16683
You can prepare for a clean split
by first using rstrip
to avoid a null column, since you have one row with a comma at the end. Then, after splitting, explode to multiple rows and split again by :
, this time using expand=True
. Then, pivot
the dataset into the desired format and concat
back to the original dataframe:
pd.concat([df,
df['Options'].str.rstrip(',')
.str.split(',')
.explode()
.str.split(':', expand=True)
.pivot(values=1, columns=0)],
axis=1).drop('Options', axis=1)
Out[1]:
Brand Color Personalization Style Wheels
0 Toyota Black NaN NaN 18
1 Toyota Black NaN NaN NaN
2 Chevy Red NaN 18 16
3 Honda Green "Customer requested detailing" NaN NaN
4 Chevy Black NaN NaN 16
Upvotes: 2