Reputation: 1456
I have 2 columns that correlate to supervisory hierarchy ids, the one is the parent and the other is the child. So imagine the parent would start with the CEO and then would have a child id such as the chief marketing officer or any of the CEOs direct reports. I'm trying to build out the entire sup org hierarchy based on these two columns.
Sup Org Name | superior org ID. | sup org ID
CEO | | 111
CPO | 111 | 222
CTO | 111 | 221
Engineering | 221 | 223
PM | 222 | 224
Backend Dev. | 223 | 228
My goal would be create the hierarchy across multiple columns so we can build out the whole hierarchy
Level 1. | Level 2 | Level 3 | Level 4
111 | 222 | 224
111 | 221 | 223
111 | 221 | 223 | 228
And so on.
I'd also like to see the above with the sup org name replaced for the id. Having it both ways is helpful for my data.
I've tried doing a lot of self-joins but there has to be a cleaner way...
Upvotes: 2
Views: 936
Reputation: 13458
With the following toy dataframe:
import pandas as pd
df = pd.DataFrame(
{
"name": {
0: "CEO",
1: "CPO",
2: "CTO",
3: "Engineering",
4: "PM",
5: "Backend_Dev",
6: "COO",
},
"ID": {0: 111, 1: 222, 2: 221, 3: 223, 4: 224, 5: 228, 6: 220},
"superior_ID": {0: "", 1: 111, 2: 111, 3: 221, 4: 222, 5: 223, 6: 111},
}
)
You could try this:
# Setup
df = df.sort_values(by=["superior_ID", "ID"]).reset_index(drop=True)
highest_ID = df.loc[df["superior_ID"] == "", "ID"].values[0]
number_of_paths = df["superior_ID"].value_counts()[highest_ID]
paths = {i + 1: [] for i in range(number_of_paths)}
# Iterate to find all paths (111/222/224 is one path)
for i, row in df.iterrows():
if row["superior_ID"] == highest_ID:
paths[i + 1].append(highest_ID)
paths[i + 1].append(row["ID"])
continue
for path in paths.values():
if row["superior_ID"] in path:
path.append(row["ID"])
# Create new df concatenating found paths as rows
new_df = pd.DataFrame()
for path in paths.values():
s = pd.Series(path)
new_df = pd.concat([new_df, s], axis=1)
# Transpose and cleanup
new_df = (
new_df
.T
.pipe(lambda x: x.iloc[x.isna().sum(axis=1).mul(-1).argsort()])
.fillna(999)
.astype(int)
.pipe(lambda x: x.set_axis([f"level_{i+1}" for i in x.columns], axis="columns"))
.replace(999, "")
.reset_index(drop=True)
)
And so:
print(new_df)
# Output
level_1 level_2 level_3 level_4
0 111 220
1 111 222 224
2 111 221 223 228
Upvotes: 1