Santoo
Santoo

Reputation: 355

Pandas column to multiple Column using parenthesis

I have a column like this -

COL A
product a [Class A], product a [Class C]
product a [Class H], product b [Class A], product b [Class B]
product c [Class C], product d [Class A]

Want to convert to this using pandas -

Class A     Class B     Class C    Class H
product a   product a   NaN        NaN
product b   product b   NaN        product a
product d   NaN         product c  NaN

Upvotes: 0

Views: 93

Answers (2)

Asetti sri harsha
Asetti sri harsha

Reputation: 989

Input:

 temp = pd.DataFrame({'COL A':['product a [Class A], product a [Class C]',
                              'product a [Class H], product b [Class A], product b [Class B]',
                              'product c [Class C], product d [Class A]']})

Use string split opertations, first split on ',' to get the list of pair on each row and then split on '[' to get the column name & value and store them in a dictionary

d = {}

cnt = 0
for s in temp['COL A']:
    d[cnt] = {i.split('[')[1][:-1]:i.split('[')[0].strip() for i in s.split(',')}
    cnt += 1

The dictionary d:

{0: {'Class A': 'product a', 'Class C': 'product a'},
 1: {'Class H': 'product a', 'Class A': 'product b', 'Class B': 'product b'},
 2: {'Class C': 'product c', 'Class A': 'product d'}}

Then create a new dataframe from the dictionary:(used transpose to get class A and class B as columns and then used sort_index to order the column names)

df = pd.DataFrame(d).transpose().sort_index(axis=1)

Output:

      Class A    Class B    Class C    Class H
0  product a        NaN  product a        NaN
1  product b  product b        NaN  product a
2  product d        NaN  product c        NaN

Upvotes: 1

fsimonjetz
fsimonjetz

Reputation: 5802

Here is another approach using str.split and explode to bring the data into a more canonical form, then extract the product name and class with a regular expression (tinker with it as needed):

  • ([\w\s]+): sequence of letters and whitespace
  • \[(.+)\]: something in between []

The final step is a basic pivot to get the desired format.

df2 = (df["COL A"].str.split(",")  # split into lists
       .explode()  # explode lists into rows
       .str.extract("([\w\s]+)\[(.+)\]")  # extract info
       .pivot(values=0, columns=1))  # pivot

Result:

1      Class A      Class B      Class C     Class H
0   product a           NaN   product a          NaN
1   product b    product b           NaN  product a 
2   product d           NaN   product c          NaN

Upvotes: 1

Related Questions