chemnteach
chemnteach

Reputation: 395

Long to Wide DataFrame in Pandas With Pivot Column Name in New Columns

I've searched quite a bit and I can't seem to find something along the line of pivot functionality for my particular problem. I'll convey a simple example of what I'm looking for:

Long Table

dependent_variable  step a  b
         5.5          1  20 30
         5.5          2  25 37
         6.1          1  22 19
         6.1          2  18 29

Desired Wide Table

dependent_variable   a_step1 a_step2 b_step1  b_step2
         5.5            20       25      30       37
         6.1            22       18      19       29

Effectively I would like to pivot on the Step column, and to make the column name for the rest of the independent variables (in this case a and b) include the step number and the a/b value associated with it.

Once pivoted, then I will use the dependent variable column and as a numpy array and the newly pivoted dependent variables to feed into various machine learning algorithms.

When I attempted piRSquared's suggestion (thank you) I got the error: Index contains duplicate entries, cannot reshape.

I then tried (from Here)

d1 =data.set_index(['dependent_variable','step'], append=True).unstack()
d1.columns = d1.columns.map(lambda x: '{}_step{}'.format(*x))
d1.reset_index(inplace=True)

And (using the example table) got the following:

level_0   dependent_variable a_step1 a_step2 b_step1 b_step2
  1               5.5           20      NaN    30       NaN
  2               5.5           NaN     25     NaN      37
  3               6.1           22      NaN    19       NaN
  4               6.1           NaN     18     NaN      29

So, I'm still missing a step

Upvotes: 5

Views: 2381

Answers (3)

sammywemmy
sammywemmy

Reputation: 28644

Note that if your combination of index and columns is duplicated, the solution preferred below won't work, as internally it relies on pd.pivot, which fails on duplicate index/column combo.

One option is with pivot_wider from pyjanitor, using the names_glue parameter to reshape the column names:

# currently in dev
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor

df.pivot_wider(
    index = 'dependent_variable', 
    names_from = 'step', 
    values_from = ['a', 'b'],
    names_glue = "{_value}_step{step}"
  )

   dependent_variable  a_step1  a_step2  b_step1  b_step2
0                 5.5       20       25       30       37
1                 6.1       22       18       19       29

in the names_glue string template, _value serves as a placeholder for values from values_from, which are a and b. Anything in the {} brackets should either be from names_from or _value to represent values_from.

Upvotes: 0

DVL
DVL

Reputation: 364

It looks like you are looking for pd.pivot

"If the values argument is omitted, and the input DataFrame has more than one column of values which are not used as column or index inputs to pivot, then the resulting “pivoted” DataFrame will have hierarchical columns whose topmost level indicates the respective value column- https://pandas.pydata.org/pandas-docs/stable/reshaping.html."

df = pd.DataFrame({'dependent_variable':[5.5,5.5,6.1,6.1],
          'step':[1,2,1,2],
          'a':[20,25,22,18],
          'b':[30,37,19,29],
         })
df = df.pivot(index='dependent_variable',
     columns='step')

yields

        a       b
step    1   2   1   2
dependent_variable              
5.5     20  25  30  37
6.1     22  18  19  29

it has a hierarchical index, which might be more helpful than the output you indicated. However, you can change to a single column index by

df.columns = df.columns.tolist()

the columns don't have the exact names you wanted, but you could then rename.

Upvotes: 2

piRSquared
piRSquared

Reputation: 294228

Assuming the name of your dataframe is df and dependent_variable, step are not already in the index

d1 = df.set_index(['dependent_variable', 'step']).unstack()
d1.columns = d1.columns.map(lambda x: '{}_step{}'.format(*x))
d1.reset_index(inplace=True)

print(d1)

   dependent_variable  a_step1  a_step2  b_step1  b_step2
0                 5.5       20       25       30       37
1                 6.1       22       18       19       29

Upvotes: 3

Related Questions