Reputation: 355
I have a column like this -
COL A
product a [Class A], product a [Class C]
product a [Class H], product b [Class A], product b [Class B]
product c [Class C], product d [Class A]
Want to convert to this using pandas -
Class A Class B Class C Class H
product a product a NaN NaN
product b product b NaN product a
product d NaN product c NaN
Upvotes: 0
Views: 93
Reputation: 989
Input:
temp = pd.DataFrame({'COL A':['product a [Class A], product a [Class C]',
'product a [Class H], product b [Class A], product b [Class B]',
'product c [Class C], product d [Class A]']})
Use string split opertations, first split on ',' to get the list of pair on each row and then split on '[' to get the column name & value and store them in a dictionary
d = {}
cnt = 0
for s in temp['COL A']:
d[cnt] = {i.split('[')[1][:-1]:i.split('[')[0].strip() for i in s.split(',')}
cnt += 1
The dictionary d:
{0: {'Class A': 'product a', 'Class C': 'product a'},
1: {'Class H': 'product a', 'Class A': 'product b', 'Class B': 'product b'},
2: {'Class C': 'product c', 'Class A': 'product d'}}
Then create a new dataframe from the dictionary:(used transpose to get class A and class B as columns and then used sort_index to order the column names)
df = pd.DataFrame(d).transpose().sort_index(axis=1)
Output:
Class A Class B Class C Class H
0 product a NaN product a NaN
1 product b product b NaN product a
2 product d NaN product c NaN
Upvotes: 1
Reputation: 5802
Here is another approach using str.split
and explode
to bring the data into a more canonical form, then extract
the product name and class
with a regular expression (tinker with it as needed):
([\w\s]+)
: sequence of letters and whitespace\[(.+)\]
: something in between []
The final step is a basic pivot
to get the desired format.
df2 = (df["COL A"].str.split(",") # split into lists
.explode() # explode lists into rows
.str.extract("([\w\s]+)\[(.+)\]") # extract info
.pivot(values=0, columns=1)) # pivot
Result:
1 Class A Class B Class C Class H
0 product a NaN product a NaN
1 product b product b NaN product a
2 product d NaN product c NaN
Upvotes: 1