Reputation: 1351
Given the following dataframe definition,
data = pd.DataFrame({"sec":["dim1", "dim2", "dim3"], # Construct a pandas DataFrame
"m1vc_15":[1,2,3],
"m1vu_15":[10,11,12],
"m2vc_15":[20,21,22],
"m2vu_15":[30,31,32],
"m1vc_16":[4,5,6],
"m1vu_16":[13,14,15],
"m2vc_16":[23,24,25],
"m2vu_16":[33,34,35]})
which generates this table structure:
sec m1vc_15 m1vu_15 m2vc_15 m2vu_15 m1vc_16 m1vu_16 m2vc_16 m2vu_16
dim1 1 10 20 30 4 13 23 33
dim2 2 11 21 31 5 14 24 34
dim3 3 12 22 32 6 15 25 35
Is there any function or package that allows me to create this expected outcome?
data2 = pd.DataFrame({"sec":np.repeat(["dim1", "dim2", "dim3", "dim1", "dim2", "dim3"],2), # Construct a pandas DataFrame
"type":np.repeat(['vc','vu'],6),
"year":["2015", "2016", "2015", "2016", "2015", "2016",
"2015", "2016", "2015", "2016", "2015", "2016"],
"m1":[1, 4, 2, 5, 3, 6, 10, 13, 11, 14, 12, 15],
"m2":[20, 23, 21, 24, 22, 25, 30, 33, 31, 34, 32, 35]})
"""
# Output
sec type year m1 m2
dim1 vc 2015 1 20
dim1 vc 2016 4 23
dim2 vc 2015 2 21
dim2 vc 2016 5 24
dim3 vc 2015 3 22
dim3 vc 2016 6 25
dim1 vu 2015 10 30
dim1 vu 2016 13 33
dim2 vu 2015 11 31
dim2 vu 2016 14 34
dim3 vu 2015 12 32
dim3 vu 2016 15 35
"""
The logic for the transpose is:
_integer
. The integer is the last two digits of the year.vu_
or vc_
. This string differentiates different types of objects. There will be no other value for type in the data source.Upvotes: 2
Views: 525
Reputation: 28699
One efficient option is with pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(data
.pivot_longer(
index = 'sec',
names_to = ('.value', 'type', 'year'),
names_pattern = r"(.\d)(.+)_(\d+)")
.assign(year = lambda df: ("20"+df.year).astype(int))
)
sec type year m1 m2
0 dim1 vc 2015 1 20
1 dim2 vc 2015 2 21
2 dim3 vc 2015 3 22
3 dim1 vu 2015 10 30
4 dim2 vu 2015 11 31
5 dim3 vu 2015 12 32
6 dim1 vc 2016 4 23
7 dim2 vc 2016 5 24
8 dim3 vc 2016 6 25
9 dim1 vu 2016 13 33
10 dim2 vu 2016 14 34
11 dim3 vu 2016 15 35
The .value
determines which columns remain as header, which in this case is m1/m2
; the rest are lumped into the type
and year
columns
A faster option would be to convert the year
column into an integer within pivot_longer
, by using the names_transform
option:
(data
.pivot_longer(
index = 'sec',
names_to = ('.value', 'type', 'year'),
names_pattern = r"(.\d)(.+)_(\d+)",
names_transform = {'year': int})
.assign(year = lambda df: df.year + 2000)
)
sec type year m1 m2
0 dim1 vc 2015 1 20
1 dim2 vc 2015 2 21
2 dim3 vc 2015 3 22
3 dim1 vu 2015 10 30
4 dim2 vu 2015 11 31
5 dim3 vu 2015 12 32
6 dim1 vc 2016 4 23
7 dim2 vc 2016 5 24
8 dim3 vc 2016 6 25
9 dim1 vu 2016 13 33
10 dim2 vu 2016 14 34
11 dim3 vu 2016 15 35
Upvotes: 1
Reputation: 214987
You can use wide_to_long
with a few more steps to clean up the columns:
df = pd.wide_to_long(data, stubnames=['m1', 'm2'], i=['sec'], j='type_year', suffix='\w+').reset_index()
df[['type', 'year']] = df.type_year.str.split('_', expand=True)
df.year = '20' + df.year
df.drop('type_year', axis=1, inplace=True)
df
sec m1 m2 type year
0 dim1 1 20 vc 2015
1 dim2 2 21 vc 2015
2 dim3 3 22 vc 2015
3 dim1 10 30 vu 2015
4 dim2 11 31 vu 2015
5 dim3 12 32 vu 2015
6 dim1 4 23 vc 2016
7 dim2 5 24 vc 2016
8 dim3 6 25 vc 2016
9 dim1 13 33 vu 2016
10 dim2 14 34 vu 2016
11 dim3 15 35 vu 2016
Upvotes: 2