LizzAlice
LizzAlice

Reputation: 720

Creating new columns from pandas df column names and values

I have a dataframe of the form:

basket            fruit_apple   fruit_pear   fruit_cherry
basket_id_1       5             NaN          6
basket_id_2       NaN           1            NaN

and I want to create two new columns that should look like this:

basket          fruit_type   number
basket_id_1     apple        5
basket_id_1     pear         NaN
basket_id_1     cherry       6
basket_id_2     apple        NaN
basket_id_2     pear         1
basket_id_2     cherry       NaN

where the contents of fruit_type are determined automatically with a colname.split('_')[1] How can this be done in a concise way?

Upvotes: 2

Views: 108

Answers (2)

sammywemmy
sammywemmy

Reputation: 28729

Transform from wide to long format,
create new columns by stripping 'fruit_' from fruit_type column
and sorting by 'basket' column

Edit: thanks to @jezrael, who pointed out the dangers of stripping (it will take out any of the alphabets, which will definitely affect scalability to other use cases), changed it to replace, which should be safer, since it works on regex and grouping.

(df.melt(id_vars='basket',
         var_name='fruit_type',
         value_name='number')
   .assign(fruit_type=lambda x: x.fruit_type.str.replace(r'fruit_',''))
   .sort_values('basket')
  )

    basket  fruit_type  number
0   basket_id_1 apple   5.0
2   basket_id_1 pear    NaN
4   basket_id_1 cherry  6.0
1   basket_id_2 apple   NaN
3   basket_id_2 pear    1.0
5   basket_id_2 cherry  NaN

Upvotes: 2

jezrael
jezrael

Reputation: 863741

I think faster is change column names insted new column (because less number of values), so I suggest use rename in first step and reshape with DataFrame.melt, last change order by DataFrame.sort_values:

df = (df.rename(columns= lambda x: x.split('_')[-1])
        .melt('basket', var_name='fruit_type', value_name='number')
        .sort_values('basket'))
print (df)
        basket fruit_type  number
0  basket_id_1      apple     5.0
2  basket_id_1       pear     NaN
4  basket_id_1     cherry     6.0
1  basket_id_2      apple     NaN
3  basket_id_2       pear     1.0
5  basket_id_2     cherry     NaN

Another idea is use DataFrame.set_index with DataFrame.stack:

df = (df.set_index('basket')
        .rename(columns= lambda x: x.split('_')[-1])
        .rename_axis('fruit_type', axis=1)
        .stack(dropna=False)
        .reset_index(name='number')
        )
print (df)
        basket fruit_type  number
0  basket_id_1      apple     5.0
1  basket_id_1       pear     NaN
2  basket_id_1     cherry     6.0
3  basket_id_2      apple     NaN
4  basket_id_2       pear     1.0
5  basket_id_2     cherry     NaN

Upvotes: 7

Related Questions