J. Roybomb
J. Roybomb

Reputation: 65

python3.7 & pandas - use column value in row as lookup value to return different column value

I've got a tricky situation - tricky for me since I'm really new to python. I've got a dataframe in pandas and I need to logic my way through building a new column that will be used later in a data match from a difference source. Basically, the picture tells what I can't figure out.

For any of the LOW labels I need to retrieve their MID_LEVEL label and copy it to a new column. The DESIRED OUTPUT column is what I need to create.

You can see that the LABEL_PATH is formatted in a way that I can use the first 9 digits as a "lookup" to find the corresponding LABEL, but I can't figure out how to achieve that. As an example, for any row that the LABEL_PATH starts with "0.02.0004" the desired output needs to be "MID_LEVEL1".

This dataset has around 25k rows, so wanted to avoid row iteration as well.

Any help would be greatly appreciated!

enter image description here

Upvotes: 1

Views: 118

Answers (2)

Odhian
Odhian

Reputation: 375

I am using this data for comparison :

test_dict = {"label_path": [1, 2, 3, 4, 5, 6], "label": ["low1", "low2", "mid1", "mid2", "high1", "high2"], "desired_output": ["mid1", "mid2", "mid1", "mid2", "high1", "high2"]}
df = pd.DataFrame(test_dict)

Which gives :

    label_path  label   desired_output
0   1           low1    mid1
1   2           low2    mid2
2   3           mid1    mid1
3   4           mid2    mid2
4   5           high1   high1
5   6           high2   high2

With a bit ogf logic and a merge :

desired_label_df = df.drop_duplicates("desired_output", keep="last")
desired_label_df = desired_label_df[["label_path", "desired_output"]]
desired_label_df.columns = ["desired_label_path", "desired_output"]
df = df.merge(desired_label_df, on="desired_output", how="left")

Gives us :

label_path  label   desired_output  desired_label_path
0   1       low1    mid1            3
1   2       low2    mid2            4
2   3       mid1    mid1            3
3   4       mid2    mid2            4
4   5       high1   high1           5
5   6       high2   high2           6

Edit: if you want to create the desired_output column, just do the following :

df["desired_output"] = df["label"].apply(lambda x: x.replace("low", "mid"))

Upvotes: 0

Marius Wallraff
Marius Wallraff

Reputation: 470

Chosing a similar example as you did:

df = pd.DataFrame({"a":["1","1.1","1.1.1","1.1.2","2"],"b":range(5)})

df["c"] = np.nan
mask = df.a.apply(lambda x: len(x.split(".")) < 3)
df.loc[mask,"c"] = df.b[mask]
df.c.fillna(method="ffill", inplace=True)

Most of the magic takes place in the line where mask is defined, but it's not that difficult: if the value in a gets split into less than 3 parts (i.e., has at most one dot), mark it as True, otherwise not.

Use that mask to copy over the values, and then fill unspecified values with valid values from above.

Upvotes: 1

Related Questions