Reputation: 41
I am trying to reshape the following data from wide to long format
df = pd.DataFrame(
{
"size_Ent": {
pd.Timestamp("2021-01-01 00:00:00"): 600,
pd.Timestamp("2021-01-02 00:00:00"): 930,
},
"size_Baci": {
pd.Timestamp("2021-01-01 00:00:00"): 700,
pd.Timestamp("2021-01-02 00:00:00"): 460,
},
"min_area_Ent": {
pd.Timestamp("2021-01-01 00:00:00"): 1240,
pd.Timestamp("2021-01-02 00:00:00"): 1503,
},
"min_area_Baci": {
pd.Timestamp("2021-01-01 00:00:00"): 1285,
pd.Timestamp("2021-01-02 00:00:00"): 953,
},
}
)
size_Ent size_Baci min_area_Ent min_area_Baci
2021-01-01 600 700 1240 1285
2021-01-02 930 460 1503 953
The problem is that the column names contain two different pieces of information separated by an underscore:
Additionally, I'd like the row indexes to remain as timestamps.
It should look like this:
bacterium min_area size
2021-01-01 Baci 1285 700
2021-01-01 Ent 1240 600
2021-01-02 Baci 953 460
2021-01-02 Ent 1503 930
I tried transposing the data frame with df.T
but this did not give the result I want.
Upvotes: 3
Views: 723
Reputation: 28709
One option is with pd.wide_to_long
:
(pd
.wide_to_long(
df.reset_index(),
stubnames=['min_area', 'size'],
i = 'index',
j = 'bacterium',
sep='_',
suffix='.+')
.reset_index('bacterium')
)
bacterium min_area size
index
2021-01-01 Ent 1240 600
2021-01-02 Ent 1503 930
2021-01-01 Baci 1285 700
2021-01-02 Baci 953 460
Another option with pivot_longer from pyjanitor - for this case we use a regex in names_pattern
:
# pip install pyjanitor
import janitor
(df
.pivot_longer(
index=None,
names_to = ('.value', 'bacterium'),
names_pattern=r'(size|min_area)_(.+)',
ignore_index=False)
)
bacterium size min_area
2021-01-01 Ent 600 1240
2021-01-02 Ent 930 1503
2021-01-01 Baci 700 1285
2021-01-02 Baci 460 953
Another option is with stack
:
temp = df.copy()
temp.columns = temp.columns.str.rsplit("_", n = 1, expand=True)
temp.columns.names = [None, 'bacterium']
temp.stack('bacterium').reset_index('bacterium')
bacterium min_area size
2021-01-01 Baci 1285 700
2021-01-01 Ent 1240 600
2021-01-02 Baci 953 460
2021-01-02 Ent 1503 930
Upvotes: 2
Reputation: 402
This can be solved in three simple steps:
First, notice that your column names are actually encoding a 2x2 MultiIndex, so let's start by creating a MultiIndex from tuples. To do this, we need to first transform the existing column names into tuples. This is easy because we know they should be split at the last underscore.
# Convert column names into MultiIndex, giving an informative name to the level with label data
column_tuples = df.columns.str.rsplit("_", n=1)
column_tuples = [tuple(c) for c in column_tuples]
df.columns = pd.MultiIndex.from_tuples(column_tuples,names=[None,'bacterium'])
Next, use df.stack()
to take the 'bacterium' level from the column MultiIndex and move it into a row MultiIndex. This is not quite the same as the transpose operation that you tried.
df = df.stack('bacterium')
Finally, use df.reset_index()
with the level
argument to take the bacterium level from the row MultiIndex and make it a proper column.
df = df.reset_index('bacterium')
Result:
bacterium min_area size
2021-01-01 Baci 1285 700
2021-01-01 Ent 1240 600
2021-01-02 Baci 953 460
2021-01-02 Ent 1503 930
Upvotes: 3