Amir Shabani
Amir Shabani

Reputation: 4177

Pandas - Create new column where values are taken from other rows in the same dataframe

I have a DataFrame like this:

    message_id  reply_to_id     sender
0   1           0               Roozbeh
1   2           1               Amir
2   3           0               Neda
3   4           2               Roozbeh
3   5           2               Neda

If the message was a reply to another message, reply_to_id shows the id of the message that it was replied to, otherwise it's 0. Now I want to create another column, reply_to_sender, where it shows the name of the sender of the message that it was replied to (and if it wasn't a reply, it can show NaN)

The message_id column is unique, but reply_to_id and sender columns are obviously not.

I tried this:

data["reply_to_sender"] = data.loc[data["reply_to_id"] == data["message_id"]]["sender"]

But it obviously won't work, because it looks at each row and perform the relational operation. What I'm trying to do is to look at each row and then find the name of the sender from other rows. For the example above, the output needs to be like this:

    message_id  reply_to_id     sender    reply_to_sender
0   1           0               Roozbeh   NaN
1   2           1               Amir      Roozbeh
2   3           0               Neda      NaN
3   4           2               Roozbeh   Amir
3   5           2               Neda      Amir

How can I do that?

Upvotes: 4

Views: 85

Answers (3)

jezrael
jezrael

Reputation: 862451

Use Series.map with Series created by message_id and sender:

df['reply_to_sender'] = df['reply_to_id'].map(df.set_index('message_id')['sender'])
print (df)
   message_id  reply_to_id   sender reply_to_sender
0           1            0  Roozbeh             NaN
1           2            1     Amir         Roozbeh
2           3            0     Neda             NaN
3           4            2  Roozbeh            Amir
3           5            2     Neda            Amir

Upvotes: 6

ignoring_gravity
ignoring_gravity

Reputation: 10476

You can do

mymap = {val: df.sender.loc[key] for key, val in df.message_id.to_dict().items()}

and then

df['reply_to_sender'] = df.reply_to_id.map(mymap)

This give you

   message_id  reply_to_id   sender reply_to_sender
0           1            0  Roozbeh             NaN
1           2            1     Amir         Roozbeh
2           3            0     Neda             NaN
3           4            2  Roozbeh            Amir
3           5            2     Neda            Amir

Upvotes: 0

Amir Shabani
Amir Shabani

Reputation: 4177

First, let's see how you would do this yourself, by hand. Then we implement it in code.

If I give you a reply_to_id, you can tell me who this message was a reply to, by simply looking at the DataFrame, finding the row whose message_id is equal to that number, and then telling me the value in that row's sender column. This can be done like this, where the reply_to_id variable is the number I gave you:

data.loc[data["message_id"] == reply_to_id]["sender"]

Now this code returns a pandas.Series, but we didn't ask for a Series, we asked for a scalar value which is the name of the sender. So we need to extract that value from the Series. If there's only one value in the Series (you need to check that), we can extract it using pandas.Series.values[0]. So the code becomes like this:

reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
    return reply_to_sender_values[0]

Now what would happen if I gave you a number that you didn't find in message_id? What would you do? You'd tell me that you found nothing. That translates to this:

reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
    return reply_to_sender_values[0]
else:
    return ""

There's one more thing we need to pay attention to. As you said, the values in reply_to_id can be zero. So we need to take care of that:

if(reply_to_id != 0):
    reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
    if len(reply_to_sender_values) == 1:
        return reply_to_sender_values[0]
    else:
        return ""
else:
    return ""

As you can see, we've just built a function to do what you would do by hand. Let's give it a name:

def reply_to_sender(reply_to_id):
    if(reply_to_id != 0):
        reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
        if len(reply_to_sender_values) == 1:
            return reply_to_sender_values[0]
        else:
            return ""
    else:
        return ""

All there's left to do is to find a way to apply this function on all the rows in reply_to_id column of our DataFrame. Luckily, there's this method in Pandas that does just that. And it's called, you guessed it, pandas.DataFrame.apply. Now it all comes together with this line of code:

data["reply_to_sender"] = data["reply_to_id"].apply(lambda x: reply_to_sender(x))

One thing to notice, I tested this code in a Jupyter Notebook. If you want to run this code from a script, you need to pass the DataFrame to your reply_to_sender function. So the code changes to this:

def reply_to_sender(data, reply_to_id):
    if(reply_to_id != 0):
        reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
        if len(reply_to_sender_values) == 1:
            return reply_to_sender_values[0]
        else:
            return ""
    else:
        return ""

data["reply_to_sender"] = data["reply_to_id"].apply(lambda x: reply_to_sender(data, x))

Upvotes: 0

Related Questions