Reputation: 4100
I have a df like this:
Country Industry 2011_0-9_AF 2011_0-9_AP
US AB 0 0
US AC 12.34 12.4
UK AB 1 2
UK AC 12 5
So, in my original dataframe I have 3
countries for every country I have 4
industries and I have 1120
columns like 2011_0-9_AF etc.
I need to transform the df like this:
Country Industry Year Group_Type Tags Value
US AB 2011 0-9 AF 0
US AB 2011 0-9 AP 0
US AC 2011 0-9 AF 12.34
US AC 2011 0-9 AP 12.4
And similarly for UK and other countries. So, I want columns to be split into 4, the value from starting to 1st underscore as Year
, then Group_Type
, then Tags
and then the value of it in Value
column
I am able to create the same in PowerBI but since it has 1120 columns, it has already taken more than 2 hours and still running and I have 5 files like this.
Looking for a solution which can be faster in Python?
Upvotes: 2
Views: 827
Reputation: 28669
The reshaping process could be abstracted by using the pivot_longer function from pyjanitor; at the moment you have to install the latest development version from github:
# install latest dev version
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
df.pivot_longer(
index=["Country", "Industry"],
names_to=("Year", "Group_Type", "Tags"),
names_sep="_",
values_to="Value",
)
Country Industry Year Group_Type Tags Value
0 US AB 2011 0-9 AF 0.00
1 US AC 2011 0-9 AF 12.34
2 UK AB 2011 0-9 AF 1.00
3 UK AC 2011 0-9 AF 12.00
4 US AB 2011 0-9 AP 0.00
5 US AC 2011 0-9 AP 12.40
6 UK AB 2011 0-9 AP 2.00
7 UK AC 2011 0-9 AP 5.00
The code above splits the columns based on the _
character and assigns the outputs to the new column names in names_to
Upvotes: 1
Reputation: 323276
Try stack
and pd.MultiIndex
df = df.set_index(['Country','Industry'])
df.columns=pd.MultiIndex.from_tuples(df.columns.str.split('_').map(tuple))
out = df.stack(level=[0,1,2]).reset_index()
out
Out[56]:
Country Industry level_2 level_3 level_4 0
0 US AB 2011 0-9 AF 0.00
1 US AB 2011 0-9 AP 0.00
2 US AC 2011 0-9 AF 12.34
3 US AC 2011 0-9 AP 12.40
4 UK AB 2011 0-9 AF 1.00
5 UK AB 2011 0-9 AP 2.00
6 UK AC 2011 0-9 AF 12.00
7 UK AC 2011 0-9 AP 5.00
Upvotes: 1
Reputation: 1116
You should check https://pandas.pydata.org/docs/reference/api/pandas.melt.html
import pandas as pd
import io
df = pd.read_csv(io.StringIO("""Country Industry 2011_0-9_AF 2011_0-9_AP
US AB 0 0
US AC 12.34 12.4
UK AB 1 2
UK AC 12 5"""), sep='\s+')
# First melt (unpivot) the columns
df_melted = df.melt(id_vars=['Country', 'Industry'], var_name='Var', value_name='Value')
# Split "Var" column
df_2 = df_melted.Var.str.split('_',expand=True)
df_2.columns = ['Year', 'Group_Type' ,'Tags']
final_df = pd.concat([df_melted.drop(columns=['Var']), df_2], axis=1)
print(final_df)
>>>
Country Industry Value Year Group_Type Tags
0 US AB 0.00 2011 0-9 AF
1 US AC 12.34 2011 0-9 AF
2 UK AB 1.00 2011 0-9 AF
3 UK AC 12.00 2011 0-9 AF
4 US AB 0.00 2011 0-9 AP
5 US AC 12.40 2011 0-9 AP
6 UK AB 2.00 2011 0-9 AP
7 UK AC 5.00 2011 0-9 AP
Upvotes: 2
Reputation: 214967
You can try melt
it and then split
the variable column by _
:
long_df = pd.melt(df, id_vars=['Country', 'Industry'])
long_df[['Year', 'Group_Type', 'Tags']] = long_df.variable.str.split('_', expand=True)
long_df.drop('variable', axis=1)
# Country Industry value Year Group_Type Tags
#0 US AB 0.00 2011 0-9 AF
#1 US AC 12.34 2011 0-9 AF
#2 UK AB 1.00 2011 0-9 AF
#3 UK AC 12.00 2011 0-9 AF
#4 US AB 0.00 2011 0-9 AP
#5 US AC 12.40 2011 0-9 AP
#6 UK AB 2.00 2011 0-9 AP
#7 UK AC 5.00 2011 0-9 AP
Upvotes: 7