Rahul Agarwal
Rahul Agarwal

Reputation: 4100

Unpivot df columns to multiple columns and rows

I have a df like this:

Country  Industry   2011_0-9_AF    2011_0-9_AP
US        AB            0               0
US        AC           12.34           12.4
UK        AB            1               2
UK        AC            12              5

So, in my original dataframe I have 3 countries for every country I have 4 industries and I have 1120 columns like 2011_0-9_AF etc.

I need to transform the df like this:

Country  Industry   Year   Group_Type    Tags    Value
US        AB         2011   0-9          AF      0
US        AB         2011   0-9          AP      0
US        AC         2011   0-9          AF      12.34
US        AC         2011   0-9          AP      12.4

And similarly for UK and other countries. So, I want columns to be split into 4, the value from starting to 1st underscore as Year, then Group_Type, then Tags and then the value of it in Value column

I am able to create the same in PowerBI but since it has 1120 columns, it has already taken more than 2 hours and still running and I have 5 files like this.

Looking for a solution which can be faster in Python?

Upvotes: 2

Views: 827

Answers (4)

sammywemmy
sammywemmy

Reputation: 28669

The reshaping process could be abstracted by using the pivot_longer function from pyjanitor; at the moment you have to install the latest development version from github:

 # install latest dev version
# pip install git+https://github.com/ericmjl/pyjanitor.git
 import janitor

df.pivot_longer(
    index=["Country", "Industry"],
    names_to=("Year", "Group_Type", "Tags"),
    names_sep="_",
    values_to="Value",
)

      Country     Industry  Year    Group_Type  Tags    Value
0        US          AB     2011        0-9     AF      0.00
1        US          AC     2011        0-9     AF      12.34
2        UK          AB     2011        0-9     AF      1.00
3        UK          AC     2011        0-9     AF      12.00
4        US          AB     2011        0-9     AP      0.00
5        US          AC     2011        0-9     AP      12.40
6        UK          AB     2011        0-9     AP      2.00
7        UK          AC     2011        0-9     AP      5.00

The code above splits the columns based on the _ character and assigns the outputs to the new column names in names_to

Upvotes: 1

BENY
BENY

Reputation: 323276

Try stack and pd.MultiIndex

df = df.set_index(['Country','Industry'])
df.columns=pd.MultiIndex.from_tuples(df.columns.str.split('_').map(tuple))
out = df.stack(level=[0,1,2]).reset_index()

out
Out[56]: 
  Country Industry level_2 level_3 level_4      0
0      US       AB    2011     0-9      AF   0.00
1      US       AB    2011     0-9      AP   0.00
2      US       AC    2011     0-9      AF  12.34
3      US       AC    2011     0-9      AP  12.40
4      UK       AB    2011     0-9      AF   1.00
5      UK       AB    2011     0-9      AP   2.00
6      UK       AC    2011     0-9      AF  12.00
7      UK       AC    2011     0-9      AP   5.00

Upvotes: 1

armamut
armamut

Reputation: 1116

You should check https://pandas.pydata.org/docs/reference/api/pandas.melt.html

import pandas as pd
import io

df = pd.read_csv(io.StringIO("""Country  Industry   2011_0-9_AF    2011_0-9_AP
US        AB            0               0
US        AC           12.34           12.4
UK        AB            1               2
UK        AC            12              5"""), sep='\s+')

# First melt (unpivot) the columns
df_melted = df.melt(id_vars=['Country', 'Industry'], var_name='Var', value_name='Value')

# Split "Var" column
df_2 = df_melted.Var.str.split('_',expand=True)
df_2.columns = ['Year', 'Group_Type' ,'Tags']
final_df = pd.concat([df_melted.drop(columns=['Var']), df_2], axis=1)

print(final_df)
>>>
  Country Industry  Value  Year Group_Type Tags
0      US       AB   0.00  2011        0-9   AF
1      US       AC  12.34  2011        0-9   AF
2      UK       AB   1.00  2011        0-9   AF
3      UK       AC  12.00  2011        0-9   AF
4      US       AB   0.00  2011        0-9   AP
5      US       AC  12.40  2011        0-9   AP
6      UK       AB   2.00  2011        0-9   AP
7      UK       AC   5.00  2011        0-9   AP

Upvotes: 2

akuiper
akuiper

Reputation: 214967

You can try melt it and then split the variable column by _:

long_df = pd.melt(df, id_vars=['Country', 'Industry'])
long_df[['Year', 'Group_Type', 'Tags']] = long_df.variable.str.split('_', expand=True)

long_df.drop('variable', axis=1)
#  Country Industry  value  Year Group_Type Tags
#0      US       AB   0.00  2011        0-9   AF
#1      US       AC  12.34  2011        0-9   AF
#2      UK       AB   1.00  2011        0-9   AF
#3      UK       AC  12.00  2011        0-9   AF
#4      US       AB   0.00  2011        0-9   AP
#5      US       AC  12.40  2011        0-9   AP
#6      UK       AB   2.00  2011        0-9   AP
#7      UK       AC   5.00  2011        0-9   AP

Upvotes: 7

Related Questions