Reputation: 124
I have a datafame in pandas with has a group of columns with hyphens (theres several but I'll use 2 as an example, _1 and _2), which both depict a different year.
df = pd.DataFrame({'A': ['BP','Virgin'],
'B(LY)': ['A','C'],
'B(LY_1)': ['B', 'D'],
'C': [1, 3],
'C_1': [2,4],
'D': ['W','Y'],
'D_1': ['X','Z']})
I'm trying to reorganise the table to pivot it, so that it look like this:
df = pd.DataFrame({'A': ['BP','BP', 'Virgin', 'Virgin'],
'Year': ['A','B','C','D'],
'C': [1,2,3,4],
'D': ['W','X','Y','Z']})
But I can't work out how to do it. The problem is, I only need the the hyphen column to match the equivalent hyphen for the other variables. Any help is appreciated, thanks
EDIT
here is a real life example of the data
df = pd.DataFrame({'Company': ['BP','Virgin'],
'Account_date(LY)': ['May','Apr'],
'Account_date(LY_1)': ['Apr', 'Mar'],
'Account_date(LY_2)': ['Mar', 'Feb'],
'Account_date(LY_3)': ['Feb', 'Jan'],
'Acc_day': [1, 5],
'Acc_day_1': [2,6],
'Acc_day_2': [3,7],
'Acc_day_2': [4,8],
'D': ['W','A'],
'D_1': ['X','B'],
'D_1': ['Y','C'],
'D_1': ['Z','D']})
desired output:
df = pd.DataFrame({'Company': ['BP','BP','BP','BP', 'Virgin', 'Virgin','Virgin', 'Virgin'],
'Year': ['May','Apr','Mar','Feb','Apr','Mar','Feb','May'],
'Acc_day': [1,2,3,4,5,6,7,8],
'D': ['W','X','Y','Z','A','B','C','D']})
Upvotes: 0
Views: 80
Reputation: 262164
You can use:
# set A aside
df2 = df.set_index('A')
# split columns to MultiIndex on "_"
df2.columns = df2.columns.str.split('_', expand=True)
# reshape
out = df2.stack().droplevel(1).rename(columns={'B': 'Year'}).reset_index()
Or using janitor's pivot_longer
:
import janitor
out = (df.pivot_longer(index='A', names_sep='_', names_to=('.value', '_drop'), sort_by_appearance=True)
.rename(columns={'B': 'Year'}).drop(columns='_drop')
)
Output:
A Year C D
0 BP A 1 W
1 BP B 2 X
2 Virgin C 3 Y
3 Virgin D 4 Z
using a mapper to match (LY)
-> _1
, etc.
import re
# you can generate this mapper programmatically if needed
mapper = {'(LY)': '_1', '(LY-1)': '_2'}
# set A aside
df2 = df.set_index('A')
# split columns to MultiIndex on "_"
pattern = '|'.join(map(re.escape, mapper))
df2.columns = df2.columns.str.replace(pattern, lambda m: mapper[m.group()], regex=True).str.split('_', expand=True)
# reshape
out = df2.stack().droplevel(1).rename(columns={'B': 'Year'}).reset_index()
Output:
A Year C D
0 BP A 1 W
1 BP B 2 X
2 Virgin C 3 Y
3 Virgin D 4 Z
Upvotes: 2