Reputation: 73
I have a dataset in the following format:
county area pop_2006 pop_2007 pop_2008
01001 275 1037 1052 1102
01003 394 2399 2424 2438
01005 312 1638 1647 1660
And I need it in a format like this:
county year pop area
01001 2006 1037 275
01001 2007 1052 275
01001 2008 1102 275
01003 2006 2399 394
01003 2007 2424 394
...
I've tried every combination of pivot_table, stack, unstack, wide_to_long that I can think of, with no success yet. (clearly I'm mostly illiterate in Python/pandas, so please be gentle...).
Upvotes: 6
Views: 7303
Reputation: 2142
Reshape df using pivot_longer.
import pandas as pd
import janitor
# Create the DataFrame
data = {
"county": ["01001", "01003", "01005"],
"area": [275, 394, 312],
"pop_2006": [1037, 2399, 1638],
"pop_2007": [1052, 2424, 1647],
"pop_2008": [1102, 2438, 1660],
}
df = pd.DataFrame(data)
"""print(df)
county area pop_2006 pop_2007 pop_2008
0 01001 275 1037 1052 1102
1 01003 394 2399 2424 2438
2 01005 312 1638 1647 1660"""
# Reshape using pivot_longer
result = (
df.pivot_longer(
index=["county", "area"],
names_to="year",
names_pattern="pop_(.*)",
values_to="pop",
sort_by_appearance=True
)
)
print(result)
"""print(result)
county area year pop
0 01001 275 2006 1037
1 01001 275 2007 1052
2 01001 275 2008 1102
3 01003 394 2006 2399
4 01003 394 2007 2424
5 01003 394 2008 2438
6 01005 312 2006 1638
7 01005 312 2007 1647
8 01005 312 2008 1660"""
Upvotes: 0
Reputation: 28709
One option is with pivot_longer from pyjanitor
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = ['county', 'area'],
names_to = ('.value', 'year'),
names_sep = '_',
sort_by_appearance=True)
)
county area year pop
0 1001 275 2006 1037
1 1001 275 2007 1052
2 1001 275 2008 1102
3 1003 394 2006 2399
4 1003 394 2007 2424
5 1003 394 2008 2438
6 1005 312 2006 1638
7 1005 312 2007 1647
8 1005 312 2008 1660
For this particular reshape, any part of the columns associated with .value
remain as column headers, while the rest are transposed into columns. you can also change the dtype of the transposed columns (this can be efficient, especially for large data sizes):
(df
.pivot_longer(
index = ['county', 'area'],
names_to = ('.value', 'year'),
names_sep = '_',
names_transform = {'year':int},
sort_by_appearance=True)
)
county area year pop
0 1001 275 2006 1037
1 1001 275 2007 1052
2 1001 275 2008 1102
3 1003 394 2006 2399
4 1003 394 2007 2424
5 1003 394 2008 2438
6 1005 312 2006 1638
7 1005 312 2007 1647
8 1005 312 2008 1660
Upvotes: 1
Reputation: 18315
As the question title suggests, we can use pd.wide_to_long
:
res = pd.wide_to_long(df, stubnames="pop", i=["county", "area"], j="year", sep="_")
to get
pop
county area year
1001 275 2006 1037
2007 1052
2008 1102
1003 394 2006 2399
2007 2424
2008 2438
1005 312 2006 1638
2007 1647
2008 1660
To exactly match the output format in the question, a reset_index
and reindex
(over columns) can be chained:
>>> res.reset_index().reindex(["county", "year", "pop", "area"], axis=1)
county year pop area
0 1001 2006 1037 275
1 1001 2007 1052 275
2 1001 2008 1102 275
3 1003 2006 2399 394
4 1003 2007 2424 394
5 1003 2008 2438 394
6 1005 2006 1638 312
7 1005 2007 1647 312
8 1005 2008 1660 312
Upvotes: 4
Reputation: 863166
You can use melt
for reshaping, then split
column variable
and drop
and sort_values
. I think you can cast column year
to int
by astype
and last change order of columns by subset
:
df1 = (pd.melt(df, id_vars=['county','area'], value_name='pop'))
df1[['tmp','year']] = df1.variable.str.split('_', expand=True)
df1 = df1.drop(['variable', 'tmp'],axis=1).sort_values(['county','year'])
df1['year'] = df1.year.astype(int)
df1 = df1[['county','year','pop','area']]
print (df1)
county year pop area
0 1001 2006 1037 275
3 1001 2007 1052 275
6 1001 2008 1102 275
1 1003 2006 2399 394
4 1003 2007 2424 394
7 1003 2008 2438 394
2 1005 2006 1638 312
5 1005 2007 1647 312
8 1005 2008 1660 312
print (df1.dtypes)
county int64
year int32
pop int64
area int64
dtype: object
Another solution with set_index
, stack
and reset_index
:
df2 = df.set_index(['county','area']).stack().reset_index(name='pop')
df2[['tmp','year']] = df2.level_2.str.split('_', expand=True)
df2 = df2.drop(['level_2', 'tmp'],axis=1)
df2['year'] = df2.year.astype(int)
df2 = df2[['county','year','pop','area']]
print (df2)
county year pop area
0 1001 2006 1037 275
1 1001 2007 1052 275
2 1001 2008 1102 275
3 1003 2006 2399 394
4 1003 2007 2424 394
5 1003 2008 2438 394
6 1005 2006 1638 312
7 1005 2007 1647 312
8 1005 2008 1660 312
print (df2.dtypes)
county int64
year int32
pop int64
area int64
dtype: object
Upvotes: 4