Reputation: 395
I've searched quite a bit and I can't seem to find something along the line of pivot functionality for my particular problem. I'll convey a simple example of what I'm looking for:
Long Table
dependent_variable step a b
5.5 1 20 30
5.5 2 25 37
6.1 1 22 19
6.1 2 18 29
Desired Wide Table
dependent_variable a_step1 a_step2 b_step1 b_step2
5.5 20 25 30 37
6.1 22 18 19 29
Effectively I would like to pivot on the Step column, and to make the column name for the rest of the independent variables (in this case a and b) include the step number and the a/b value associated with it.
Once pivoted, then I will use the dependent variable column and as a numpy array and the newly pivoted dependent variables to feed into various machine learning algorithms.
When I attempted piRSquared's suggestion (thank you) I got the error: Index contains duplicate entries, cannot reshape.
I then tried (from Here)
d1 =data.set_index(['dependent_variable','step'], append=True).unstack()
d1.columns = d1.columns.map(lambda x: '{}_step{}'.format(*x))
d1.reset_index(inplace=True)
And (using the example table) got the following:
level_0 dependent_variable a_step1 a_step2 b_step1 b_step2
1 5.5 20 NaN 30 NaN
2 5.5 NaN 25 NaN 37
3 6.1 22 NaN 19 NaN
4 6.1 NaN 18 NaN 29
So, I'm still missing a step
Upvotes: 5
Views: 2381
Reputation: 28644
Note that if your combination of index and columns is duplicated, the solution preferred below won't work, as internally it relies on pd.pivot
, which fails on duplicate index/column combo.
One option is with pivot_wider from pyjanitor, using the names_glue
parameter to reshape the column names:
# currently in dev
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
df.pivot_wider(
index = 'dependent_variable',
names_from = 'step',
values_from = ['a', 'b'],
names_glue = "{_value}_step{step}"
)
dependent_variable a_step1 a_step2 b_step1 b_step2
0 5.5 20 25 30 37
1 6.1 22 18 19 29
in the names_glue string template, _value
serves as a placeholder for values from values_from
, which are a
and b
. Anything in the {}
brackets should either be from names_from
or _value
to represent values_from
.
Upvotes: 0
Reputation: 364
It looks like you are looking for pd.pivot
"If the values argument is omitted, and the input DataFrame has more than one column of values which are not used as column or index inputs to pivot, then the resulting “pivoted” DataFrame will have hierarchical columns whose topmost level indicates the respective value column- https://pandas.pydata.org/pandas-docs/stable/reshaping.html."
df = pd.DataFrame({'dependent_variable':[5.5,5.5,6.1,6.1],
'step':[1,2,1,2],
'a':[20,25,22,18],
'b':[30,37,19,29],
})
df = df.pivot(index='dependent_variable',
columns='step')
yields
a b
step 1 2 1 2
dependent_variable
5.5 20 25 30 37
6.1 22 18 19 29
it has a hierarchical index, which might be more helpful than the output you indicated. However, you can change to a single column index by
df.columns = df.columns.tolist()
the columns don't have the exact names you wanted, but you could then rename.
Upvotes: 2
Reputation: 294228
Assuming the name of your dataframe is df
and dependent_variable
, step
are not already in the index
d1 = df.set_index(['dependent_variable', 'step']).unstack()
d1.columns = d1.columns.map(lambda x: '{}_step{}'.format(*x))
d1.reset_index(inplace=True)
print(d1)
dependent_variable a_step1 a_step2 b_step1 b_step2
0 5.5 20 25 30 37
1 6.1 22 18 19 29
Upvotes: 3