Reputation: 549
I have a dataframe that looks like this:
df = pd.DataFrame({"Object": ['Apple', 'Orange', 'Banana', 'Grape', 'Cherry'],
"Jan 01 Vol": [0, 5, 2, 4, 8],
"Jan 01 Price": [1.15, 2.30, 1.75, 3.4, 2.5],
"Jan 01 Sales": [0, 11.5, 5.25, 13.6, 20],
"Jan 02 Vol": [1, 2, 3, 4, 5],
"Jan 02 Price": [1.15, 2.30, 1.75, 3.4, 2.5],
"Jan 02 Sales": [1.15, 4.6, 5.25, 13.6, 12.5],
"Feb 01 Vol": [5, 4, 3, 2, 1],
"Feb 01 Price": [1.15, 2.30, 1.75, 3.4, 2.5],
"Feb 01 Sales": [5.75, 9.2, 5.25, 6.8, 2.5],})
I want to be able to manipulate the dataframe such that "Vol", "Price", "Sales" are their own columns while pivoting the date aspect of the column vertically so that it looks like this:
df2 = pd.DataFrame({"Object": ['Apple', 'Apple', 'Apple',
'Orange','Orange', 'Orange',
'Banana', 'Banana', 'Banana',
'Grape', 'Grape', 'Grape',
'Cherry', 'Cherry', 'Cherry'],
"Year": [2001, 2001, 2002,
2001, 2001, 2002,
2001, 2001, 2002,
2001, 2001, 2002,
2001, 2001, 2002],
"Month": [1, 2, 1,
1, 2, 1,
1, 2, 1,
1, 2, 1,
1, 2, 1],
"Vol": [0, 5, 1, 5, 4, 2, 2, 3, 3, 4, 2, 4, 8, 1, 5],
"Price": [1.15, 1.15, 1.15, 2.30, 2.30, 2.30, 1.75, 1.75, 1.75, 3.4, 3.4, 3.4, 2.5, 2.5, 2.5],
"Sales": [0, 5.75, 1.15, 11.50, 9.2, 4.6, 5.25, 5.25, 5.25, 13.60, 6.8, 13.60, 20, 2.5, 12.5]})
I thought about doing a lambda function that creates a new column that pulls in the year from the horizontal column names, but that won't work because the array lengths are different. I also thought about doing a pivot table, but again, wasn't sure how to parse out the "Vol", "Price" "Sales" aspects of the columns into their own column. Any help would be appreciated.
Upvotes: 2
Views: 694
Reputation: 3353
dfm = df.melt(id_vars='Object')
df3 = pd.concat([dfm[['Object', 'value']], dfm['variable'].str.split(expand=True)], axis=1)
df3.rename(columns={0: 'Month', 1: 'Year', 2:'Type'}, inplace=True)
df3 = df3.set_index(['Object', 'Year', 'Month', 'Type']).unstack()['value'].reset_index()
df3['Year'] = df3['Year'].astype(int)+2000
df3['Month'] = pd.to_datetime(df3['Month'], format='%b').dt.month
#Output
#Type Object Year Month Price Sales Vol
#0 Apple 2001 2 1.15 5.75 5.0
#1 Apple 2001 1 1.15 0.00 0.0
#2 Apple 2002 1 1.15 1.15 1.0
#3 Banana 2001 2 1.75 5.25 3.0
#4 Banana 2001 1 1.75 5.25 2.0
#5 Banana 2002 1 1.75 5.25 3.0
#6 Cherry 2001 2 2.50 2.50 1.0
#7 Cherry 2001 1 2.50 20.00 8.0
#8 Cherry 2002 1 2.50 12.50 5.0
#9 Grape 2001 2 3.40 6.80 2.0
#10 Grape 2001 1 3.40 13.60 4.0
#11 Grape 2002 1 3.40 13.60 4.0
#12 Orange 2001 2 2.30 9.20 4.0
#13 Orange 2001 1 2.30 11.50 5.0
#14 Orange 2002 1 2.30 4.60 2.0
I would first use pd.melt
to do the transformations. Use .str.split
with expand=True
to split the information in the column variable
(which was constructed by pd.melt
from the columns) into three separate columns and rename them to something meaningful. Then use set_index
so we can unstack
, which spreads the information over three columns from the long form into wide-form, as desired. Lastly, alter the datetime characteristics into the numbers you desired.
Hope that helps
Upvotes: 3
Reputation: 153460
You can use pd.wide_to_long
with some column renaming and pd.to_datetime
with .dt accessor to get year
and month
attributes:
df = df.set_index('Object')
df.columns = df.columns.str.replace(r'(.+) ([Vol|Price|Sales]+)',r'\2_\1')
df_out = pd.wide_to_long(df.reset_index(),['Vol','Price','Sales'],'Object','Months','_','.+')
df_out = df_out.reset_index()
df_out['Months'] = pd.to_datetime(df_out['Months'], format='%b %y')
df_out['Year'] = df_out['Months'].dt.year
df_out['Month'] = df_out['Months'].dt.month
df_out = df_out.drop('Months', axis=1).sort_values(['Object','Year'])
print(df_out)
Output:
Object Vol Price Sales Year Month
0 Apple 0 1.15 0.00 2001 1
10 Apple 5 1.15 5.75 2001 2
5 Apple 1 1.15 1.15 2002 1
2 Banana 2 1.75 5.25 2001 1
12 Banana 3 1.75 5.25 2001 2
7 Banana 3 1.75 5.25 2002 1
4 Cherry 8 2.50 20.00 2001 1
14 Cherry 1 2.50 2.50 2001 2
9 Cherry 5 2.50 12.50 2002 1
3 Grape 4 3.40 13.60 2001 1
13 Grape 2 3.40 6.80 2001 2
8 Grape 4 3.40 13.60 2002 1
1 Orange 5 2.30 11.50 2001 1
11 Orange 4 2.30 9.20 2001 2
6 Orange 2 2.30 4.60 2002 1
Upvotes: 2