Reputation: 67
I am trying to stack financial values year wise in pandas Dataframe. But not able to get started.
All I have tried is
df1 = df.set_index(['refnum','y1gp','y2gp','y3gp']).stack()\
.reset_index(name='REV').rename(columns={'level_5':'Year'})
Existing:
refnum | y1 | y1rev | y1gp | y2 | y2rev | y2gp | y3 | y3rev | y3gp |
---|---|---|---|---|---|---|---|---|---|
10001 | 2021 | 300 | 200 | 2022 | 100 | 600 | 2023 | 300 | 300 |
10002 | 2020 | 300 | 200 | 2021 | 200 | 500 | 2022 | 300 | 300 |
10003 | 2021 | 300 | 200 | 2022 | 500 | 500 | 2023 | 300 | 300 |
Expected:
refnum | year | REV | GP | Base Year |
---|---|---|---|---|
10001 | 2021 | 300 | 200 | BaseYear |
10001 | 2022 | 100 | 600 | BaseYear+1 |
10001 | 2023 | 300 | 300 | BaseYear+2 |
10002 | 2020 | 300 | 200 | BaseYear |
10002 | 2021 | 200 | 500 | BaseYear+1 |
10002 | 2022 | 300 | 300 | BaseYear+2 |
10003 | 2021 | 300 | 200 | BaseYear |
10003 | 2022 | 500 | 500 | BaseYear+1 |
10003 | 2023 | 300 | 300 | BaseYear+2 |
Upvotes: 1
Views: 798
Reputation: 28644
You could use pivot_longer
from pyjanitor
; for this scenario, you pass regexes to names_pattern
, and the new column names in names_to
:
# pip install pyjanitor
import janitor
import pandas as pd
df.pivot_longer(index='refnum',
names_to=['year', 'REV', 'GP'],
names_pattern=['^y\d$', '.*rev$', '.*gp$']
)
refnum year REV GP
0 10001 2021 300 200
1 10002 2020 300 200
2 10003 2021 300 200
3 10001 2022 100 600
4 10002 2021 200 500
5 10003 2022 500 500
6 10001 2023 300 300
7 10002 2022 300 300
8 10003 2023 300 300
If you want the base year included, you can modify the column labels that end with a number, before using pivot_longer
:
(df.rename(columns = lambda col: f"{col}YEAR"
if col.endswith(('1','2','3'))
else col)
.pivot_longer(index='refnum',
names_to= ("Base Year", ".value"),
names_pattern=r".(\d)(.+)",
sort_by_appearance=True)
)
refnum Base Year YEAR rev gp
0 10001 1 2021 300 200
1 10001 2 2022 100 600
2 10001 3 2023 300 300
3 10002 1 2020 300 200
4 10002 2 2021 200 500
5 10002 3 2022 300 300
6 10003 1 2021 300 200
7 10003 2 2022 500 500
8 10003 3 2023 300 300
labels associated with .value
stay as column headers, while the rest are lumped into a new column (base year
)
Upvotes: 2
Reputation: 195418
Try:
df.columns = [re.sub(r"y(\d+)(.*)", r"\2-\1", c) for c in df.columns]
x = (
pd.wide_to_long(
df, stubnames=["", "gp", "rev"], sep="-", i="refnum", j="Base Year"
)
.rename(columns={"": "year"})
.reset_index()
.sort_values(by="refnum")
)
print(x)
Prints:
refnum Base Year year gp rev
0 10001 1 2021 200 300
3 10001 2 2022 600 100
6 10001 3 2023 300 300
1 10002 1 2020 200 300
4 10002 2 2021 500 200
7 10002 3 2022 300 300
2 10003 1 2021 200 300
5 10003 2 2022 500 500
8 10003 3 2023 300 300
Upvotes: 4
Reputation: 35626
Let's convert the headers to a useable MultiIndex using str.replace
and str.split
then stack
to go from wide-form to long. Then groupby cumcount
to create the BaseYear column.
# Save Columns
df = df.set_index('refnum')
# Create a MultiIndex with Numbers at the end and split into multiple levels
df.columns = (
df.columns.str.replace(r'^(.*?)(\d+)(.*)$', r'\1\3/\2', regex=True)
.str.split('/', expand=True)
)
# Wide Format to Long + Rename Columns
df = df.stack().droplevel(-1).reset_index().rename(
columns={'y': 'Year', 'ygp': 'GP', 'yrev': 'REV'}
)
# Add Base Year Column
df['BaseYear'] = "BaseYear+" + df.groupby('refnum').cumcount().astype(str)
# df['BaseYear'] = df.groupby('refnum').cumcount() # (int version)
df
:
refnum Year GP REV BaseYear
0 10001 2021 200 300 BaseYear+0
1 10001 2022 600 100 BaseYear+1
2 10001 2023 300 300 BaseYear+2
3 10002 2020 200 300 BaseYear+0
4 10002 2021 500 200 BaseYear+1
5 10002 2022 300 300 BaseYear+2
6 10003 2021 200 300 BaseYear+0
7 10003 2022 500 500 BaseYear+1
8 10003 2023 300 300 BaseYear+2
Upvotes: 3