Ted Mosby
Ted Mosby

Reputation: 1456

Flattening Hierarchy on parent and child id in python

I have 2 columns that correlate to supervisory hierarchy ids, the one is the parent and the other is the child. So imagine the parent would start with the CEO and then would have a child id such as the chief marketing officer or any of the CEOs direct reports. I'm trying to build out the entire sup org hierarchy based on these two columns.

Sup Org Name | superior org ID.  | sup org ID
CEO          |                   | 111
CPO          | 111               | 222
CTO          | 111               | 221 
Engineering  | 221               | 223
PM           | 222               | 224
Backend Dev. | 223               | 228

My goal would be create the hierarchy across multiple columns so we can build out the whole hierarchy

Level 1. | Level 2 | Level 3 | Level 4
111      | 222     | 224
111      | 221     | 223 
111      | 221     | 223     | 228

And so on.

I'd also like to see the above with the sup org name replaced for the id. Having it both ways is helpful for my data.

I've tried doing a lot of self-joins but there has to be a cleaner way...

Upvotes: 2

Views: 936

Answers (1)

Laurent
Laurent

Reputation: 13458

With the following toy dataframe:

import pandas as pd

df = pd.DataFrame(
    {
        "name": {
            0: "CEO",
            1: "CPO",
            2: "CTO",
            3: "Engineering",
            4: "PM",
            5: "Backend_Dev",
            6: "COO",
        },
        "ID": {0: 111, 1: 222, 2: 221, 3: 223, 4: 224, 5: 228, 6: 220},
        "superior_ID": {0: "", 1: 111, 2: 111, 3: 221, 4: 222, 5: 223, 6: 111},
    }
)

You could try this:

# Setup
df = df.sort_values(by=["superior_ID", "ID"]).reset_index(drop=True)
highest_ID = df.loc[df["superior_ID"] == "", "ID"].values[0]
number_of_paths = df["superior_ID"].value_counts()[highest_ID]
paths = {i + 1: [] for i in range(number_of_paths)}

# Iterate to find all paths (111/222/224 is one path)
for i, row in df.iterrows():
    if row["superior_ID"] == highest_ID:
        paths[i + 1].append(highest_ID)
        paths[i + 1].append(row["ID"])
        continue
    for path in paths.values():
        if row["superior_ID"] in path:
            path.append(row["ID"])

# Create new df concatenating found paths as rows
new_df = pd.DataFrame()
for path in paths.values():
    s = pd.Series(path)
    new_df = pd.concat([new_df, s], axis=1)

# Transpose and cleanup
new_df = (
    new_df
    .T
    .pipe(lambda x: x.iloc[x.isna().sum(axis=1).mul(-1).argsort()])
    .fillna(999)
    .astype(int)
    .pipe(lambda x: x.set_axis([f"level_{i+1}" for i in x.columns], axis="columns"))
    .replace(999, "")
    .reset_index(drop=True)
)

And so:

print(new_df)
# Output
   level_1  level_2 level_3 level_4
0      111      220
1      111      222     224        
2      111      221     223     228

Upvotes: 1

Related Questions