Reputation: 65
I've got a tricky situation - tricky for me since I'm really new to python. I've got a dataframe in pandas and I need to logic my way through building a new column that will be used later in a data match from a difference source. Basically, the picture tells what I can't figure out.
For any of the LOW labels I need to retrieve their MID_LEVEL label and copy it to a new column. The DESIRED OUTPUT column is what I need to create.
You can see that the LABEL_PATH is formatted in a way that I can use the first 9 digits as a "lookup" to find the corresponding LABEL, but I can't figure out how to achieve that. As an example, for any row that the LABEL_PATH starts with "0.02.0004" the desired output needs to be "MID_LEVEL1".
This dataset has around 25k rows, so wanted to avoid row iteration as well.
Any help would be greatly appreciated!
Upvotes: 1
Views: 118
Reputation: 375
I am using this data for comparison :
test_dict = {"label_path": [1, 2, 3, 4, 5, 6], "label": ["low1", "low2", "mid1", "mid2", "high1", "high2"], "desired_output": ["mid1", "mid2", "mid1", "mid2", "high1", "high2"]}
df = pd.DataFrame(test_dict)
Which gives :
label_path label desired_output
0 1 low1 mid1
1 2 low2 mid2
2 3 mid1 mid1
3 4 mid2 mid2
4 5 high1 high1
5 6 high2 high2
With a bit ogf logic and a merge :
desired_label_df = df.drop_duplicates("desired_output", keep="last")
desired_label_df = desired_label_df[["label_path", "desired_output"]]
desired_label_df.columns = ["desired_label_path", "desired_output"]
df = df.merge(desired_label_df, on="desired_output", how="left")
Gives us :
label_path label desired_output desired_label_path
0 1 low1 mid1 3
1 2 low2 mid2 4
2 3 mid1 mid1 3
3 4 mid2 mid2 4
4 5 high1 high1 5
5 6 high2 high2 6
Edit: if you want to create the desired_output column, just do the following :
df["desired_output"] = df["label"].apply(lambda x: x.replace("low", "mid"))
Upvotes: 0
Reputation: 470
Chosing a similar example as you did:
df = pd.DataFrame({"a":["1","1.1","1.1.1","1.1.2","2"],"b":range(5)})
df["c"] = np.nan
mask = df.a.apply(lambda x: len(x.split(".")) < 3)
df.loc[mask,"c"] = df.b[mask]
df.c.fillna(method="ffill", inplace=True)
Most of the magic takes place in the line where mask
is defined, but it's not that difficult: if the value in a
gets split into less than 3 parts (i.e., has at most one dot), mark it as True
, otherwise not.
Use that mask to copy over the values, and then fill unspecified values with valid values from above.
Upvotes: 1