Reputation: 2293
I have this data that I want to unpivot and melt into columns. The data is a multi-header table. I have a sample dictionary of the data.
Edit here___
I don't know how to convert a dictionary with multiple keys like I had shown previously into a df so let's restructure the dictionary like so...
data = {
"id": {
0: "month",
1: "11/30/2021",
2: "12/31/2021",
3: "1/31/2022",
4: "2/28/2022",
5: "3/31/2022",
},
"A48": {0: "storage", 1: "0", 2: "29", 3: "35", 4: "33", 5: "30"},
"A48.1": {0: "use", 1: "0", 2: "1", 3: "0", 4: "0", 5: "0"},
"A62": {0: "direct", 1: "0", 2: "0", 3: "2", 4: "3", 5: "2"},
"A62.1": {0: "storage", 1: "0", 2: "57", 3: "69", 4: "65", 5: "59"},
"A62.2": {0: "use", 1: "0", 2: "1", 3: "0", 4: "0", 5: "0"},
}
Now let's get the Dataframe...
dfc = pd.DataFrame.from_dict(data)
dfc.columns=pd.MultiIndex.from_arrays([dfc.columns,dfc.iloc[0]])
dfc = dfc.iloc[1:].reset_index(drop=True)
Which looks like this:
id A48 A48.1 A62 A62.1 A62.2
month storage use direct storage use
0 11/30/2021 0 0 0 0 0
1 12/31/2021 29 1 0 57 1
2 1/31/2022 35 0 2 69 0
3 2/28/2022 33 0 3 65 0
4 3/31/2022 30 0 2 59 0
What I am looking for is a table like this.
month | id | direct | storage | use |
---|---|---|---|---|
11/30/2021 | A48 | NaN | 0 | 0 |
12/31/2021 | A48 | NaN | 29 | 1 |
1/31/2022 | A48 | NaN | 35 | 0 |
2/28/2022 | A48 | NaN | 33 | 0 |
3/31/2022 | A48 | NaN | 30 | 0 |
11/30/2021 | A62 | 0 | 0 | 0 |
12/31/2021 | A62 | 0 | 57 | 1 |
1/31/2022 | A62 | 2 | 69 | 0 |
2/28/2022 | A62 | 3 | 65 | 0 |
3/31/2022 | A62 | 2 | 59 | 0 |
Upvotes: 2
Views: 444
Reputation: 28644
One option is to flatten the columns, then reshape with pivot_longer from pyjanitor, using a regular expression to capture the groups:
# pip install pyjanitor
import pandas as pd
import janitor
# flatten columns:
# an alternative is dfc.collapse_levels()
# from pyjanitor
dfc.columns = dfc.columns.map('_'.join)
# reshape:
(dfc
.pivot_longer(
index = 'id_month',
names_to = ('id', '.value'),
names_pattern = r"([a-zA-Z]\d+)\.?\d?_(.+)")
.rename(columns={'id_month':'month'})
)
month id storage use direct
0 11/30/2021 A48 0 0 NaN
1 12/31/2021 A48 29 1 NaN
2 1/31/2022 A48 35 0 NaN
3 2/28/2022 A48 33 0 NaN
4 3/31/2022 A48 30 0 NaN
5 11/30/2021 A62 0 0 0
6 12/31/2021 A62 57 1 0
7 1/31/2022 A62 69 0 2
8 2/28/2022 A62 65 0 3
9 3/31/2022 A62 59 0 2
Another option is to rebuild the MultiIndex column, and then stack
:
dfc.columns.names = ['top','bottom']
top = dfc.columns.get_level_values(level='top').str.split('.').str[0]
bottom = dfc.columns.get_level_values('bottom')
cols = pd.MultiIndex.from_arrays([top, bottom], names = ['top','bottom'])
dfc.columns = cols
(dfc
.set_index(('id','month'))
.stack('top')
.rename_axis(index=['month','id'], columns=None)
.reset_index()
)
month id direct storage use
0 11/30/2021 A48 NaN 0 0
1 11/30/2021 A62 0 0 0
2 12/31/2021 A48 NaN 29 1
3 12/31/2021 A62 0 57 1
4 1/31/2022 A48 NaN 35 0
5 1/31/2022 A62 2 69 0
6 2/28/2022 A48 NaN 33 0
7 2/28/2022 A62 3 65 0
8 3/31/2022 A48 NaN 30 0
9 3/31/2022 A62 2 59 0
Upvotes: 1
Reputation: 13458
Define for later use the following helper function:
import pandas as pd
def helper(df):
return df.pipe(
lambda df_: df_.rename(columns={"col1": df_["col0"].unique()[0]})
.drop(columns="col0")
.reset_index(drop=True)
)
Then, with Pandas melt, concat and merge methods:
# Setup
n = dfc.shape[0]
# Melt dataframe and cleanup
melted_dfc = dfc.melt()
melted_dfc.columns = ["id", "col0", "col1"]
melted_dfc["id"] = melted_dfc["id"].replace(r"[.]\d+", "", regex=True)
# Get intermediate dataframes
month_df = helper(melted_dfc.loc[: n - 1, :]).drop(columns="id")
sub_dfs = [
pd.concat([month_df, helper(df)], axis=1)
for df in [
melted_dfc.loc[i : i + n - 1, :] for i in range(n, melted_dfc.shape[0], n)
]
]
# Merge intermediate dataframes
final_df = sub_dfs[0]
for sub_df in sub_dfs[1:]:
final_df = pd.merge(
left=final_df, right=sub_df, how="outer", on=["month", "id"]
).fillna(0)
# Cleanup temporary columns created during merge
columns_to_merge = set(
col[:-2] for col in final_df.columns if col.endswith(("_x", "_y"))
)
for col in columns_to_merge:
final_df[col] = final_df[f"{col}_x"].astype(int) + final_df[f"{col}_y"].astype(int)
final_df = final_df.drop(columns=[f"{col}_x", f"{col}_y"])
Finally:
print(final_df)
# Output
month id direct storage use
0 12/31/2021 A48 0 29 1
1 1/31/2022 A48 0 35 0
2 2/28/2022 A48 0 33 0
3 3/31/2022 A48 0 30 0
4 12/31/2021 A62 0 57 1
5 1/31/2022 A62 2 69 0
6 2/28/2022 A62 3 65 0
7 3/31/2022 A62 2 59 0
Upvotes: 1