Reputation: 73

Wide to long data transform in pandas

I have a dataset in the following format:

county   area    pop_2006    pop_2007    pop_2008
01001    275      1037         1052        1102
01003    394      2399         2424        2438
01005    312      1638         1647        1660

And I need it in a format like this:

county    year   pop      area
01001     2006   1037      275
01001     2007   1052      275
01001     2008   1102      275
01003     2006   2399      394
01003     2007   2424      394
...

I've tried every combination of pivot_table, stack, unstack, wide_to_long that I can think of, with no success yet. (clearly I'm mostly illiterate in Python/pandas, so please be gentle...).

Upvotes: 6

Answers (4)

Soudipta Dutta

Reputation: 2142

Reshape df using pivot_longer.

import pandas as pd
import janitor

# Create the DataFrame
data = {
    "county": ["01001", "01003", "01005"],
    "area": [275, 394, 312],
    "pop_2006": [1037, 2399, 1638],
    "pop_2007": [1052, 2424, 1647],
    "pop_2008": [1102, 2438, 1660],
}
df = pd.DataFrame(data)
"""print(df)
  county  area  pop_2006  pop_2007  pop_2008
0  01001   275      1037      1052      1102
1  01003   394      2399      2424      2438
2  01005   312      1638      1647      1660"""

# Reshape using pivot_longer
result = (
    df.pivot_longer(
        index=["county", "area"],
        names_to="year",
        names_pattern="pop_(.*)",
        values_to="pop",
        sort_by_appearance=True
    )
  
)
print(result)

"""print(result)
  county  area  year   pop
0  01001   275  2006  1037
1  01001   275  2007  1052
2  01001   275  2008  1102
3  01003   394  2006  2399
4  01003   394  2007  2424
5  01003   394  2008  2438
6  01005   312  2006  1638
7  01005   312  2007  1647
8  01005   312  2008  1660"""

Upvotes: 0

sammywemmy

Reputation: 28709

One option is with pivot_longer from pyjanitor

# pip install pyjanitor
import pandas as pd
import janitor

(df
.pivot_longer(
    index = ['county', 'area'], 
    names_to = ('.value', 'year'), 
    names_sep = '_', 
    sort_by_appearance=True)
)
   county  area  year   pop
0    1001   275  2006  1037
1    1001   275  2007  1052
2    1001   275  2008  1102
3    1003   394  2006  2399
4    1003   394  2007  2424
5    1003   394  2008  2438
6    1005   312  2006  1638
7    1005   312  2007  1647
8    1005   312  2008  1660

For this particular reshape, any part of the columns associated with .value remain as column headers, while the rest are transposed into columns. you can also change the dtype of the transposed columns (this can be efficient, especially for large data sizes):

(df
.pivot_longer(
    index = ['county', 'area'], 
    names_to = ('.value', 'year'), 
    names_sep = '_', 
    names_transform = {'year':int},
    sort_by_appearance=True)
)
   county  area  year   pop
0    1001   275  2006  1037
1    1001   275  2007  1052
2    1001   275  2008  1102
3    1003   394  2006  2399
4    1003   394  2007  2424
5    1003   394  2008  2438
6    1005   312  2006  1638
7    1005   312  2007  1647
8    1005   312  2008  1660

Upvotes: 1

Mustafa Aydın

Reputation: 18315

As the question title suggests, we can use pd.wide_to_long:

res = pd.wide_to_long(df, stubnames="pop", i=["county", "area"], j="year", sep="_")

to get

                   pop
county area year
1001   275  2006  1037
            2007  1052
            2008  1102
1003   394  2006  2399
            2007  2424
            2008  2438
1005   312  2006  1638
            2007  1647
            2008  1660

To exactly match the output format in the question, a reset_index and reindex (over columns) can be chained:

>>> res.reset_index().reindex(["county", "year", "pop", "area"], axis=1)

   county  year   pop  area
0    1001  2006  1037   275
1    1001  2007  1052   275
2    1001  2008  1102   275
3    1003  2006  2399   394
4    1003  2007  2424   394
5    1003  2008  2438   394
6    1005  2006  1638   312
7    1005  2007  1647   312
8    1005  2008  1660   312

Upvotes: 4

jezrael

Reputation: 863166

You can use melt for reshaping, then split column variable and drop and sort_values. I think you can cast column year to int by astype and last change order of columns by subset:

df1 = (pd.melt(df, id_vars=['county','area'], value_name='pop'))
df1[['tmp','year']] = df1.variable.str.split('_', expand=True)
df1 = df1.drop(['variable', 'tmp'],axis=1).sort_values(['county','year'])
df1['year'] = df1.year.astype(int)
df1 = df1[['county','year','pop','area']]
print (df1)
   county  year   pop  area
0    1001  2006  1037   275
3    1001  2007  1052   275
6    1001  2008  1102   275
1    1003  2006  2399   394
4    1003  2007  2424   394
7    1003  2008  2438   394
2    1005  2006  1638   312
5    1005  2007  1647   312
8    1005  2008  1660   312

print (df1.dtypes)
county    int64
year      int32
pop       int64
area      int64
dtype: object

Another solution with set_index, stack and reset_index:

df2 = df.set_index(['county','area']).stack().reset_index(name='pop')
df2[['tmp','year']] = df2.level_2.str.split('_', expand=True)
df2 = df2.drop(['level_2', 'tmp'],axis=1)
df2['year'] = df2.year.astype(int)
df2 = df2[['county','year','pop','area']]

print (df2)
   county  year   pop  area
0    1001  2006  1037   275
1    1001  2007  1052   275
2    1001  2008  1102   275
3    1003  2006  2399   394
4    1003  2007  2424   394
5    1003  2008  2438   394
6    1005  2006  1638   312
7    1005  2007  1647   312
8    1005  2008  1660   312

print (df2.dtypes)
county    int64
year      int32
pop       int64
area      int64
dtype: object

Upvotes: 4

Wide to long data transform in pandas

Answers (4)

Related Questions