Reputation: 4177
I have a DataFrame
like this:
message_id reply_to_id sender
0 1 0 Roozbeh
1 2 1 Amir
2 3 0 Neda
3 4 2 Roozbeh
3 5 2 Neda
If the message was a reply to another message, reply_to_id
shows the id of the message that it was replied to, otherwise it's 0
. Now I want to create another column, reply_to_sender
, where it shows the name of the sender of the message that it was replied to (and if it wasn't a reply, it can show NaN
)
The message_id
column is unique, but reply_to_id
and sender
columns are obviously not.
I tried this:
data["reply_to_sender"] = data.loc[data["reply_to_id"] == data["message_id"]]["sender"]
But it obviously won't work, because it looks at each row and perform the relational operation. What I'm trying to do is to look at each row and then find the name of the sender from other rows. For the example above, the output needs to be like this:
message_id reply_to_id sender reply_to_sender
0 1 0 Roozbeh NaN
1 2 1 Amir Roozbeh
2 3 0 Neda NaN
3 4 2 Roozbeh Amir
3 5 2 Neda Amir
How can I do that?
Upvotes: 4
Views: 85
Reputation: 862451
Use Series.map
with Series
created by message_id
and sender
:
df['reply_to_sender'] = df['reply_to_id'].map(df.set_index('message_id')['sender'])
print (df)
message_id reply_to_id sender reply_to_sender
0 1 0 Roozbeh NaN
1 2 1 Amir Roozbeh
2 3 0 Neda NaN
3 4 2 Roozbeh Amir
3 5 2 Neda Amir
Upvotes: 6
Reputation: 10476
You can do
mymap = {val: df.sender.loc[key] for key, val in df.message_id.to_dict().items()}
and then
df['reply_to_sender'] = df.reply_to_id.map(mymap)
This give you
message_id reply_to_id sender reply_to_sender
0 1 0 Roozbeh NaN
1 2 1 Amir Roozbeh
2 3 0 Neda NaN
3 4 2 Roozbeh Amir
3 5 2 Neda Amir
Upvotes: 0
Reputation: 4177
First, let's see how you would do this yourself, by hand. Then we implement it in code.
If I give you a reply_to_id
, you can tell me who this message was a reply to, by simply looking at the DataFrame
, finding the row whose message_id
is equal to that number, and then telling me the value in that row's sender
column. This can be done like this, where the reply_to_id
variable is the number I gave you:
data.loc[data["message_id"] == reply_to_id]["sender"]
Now this code returns a pandas.Series, but we didn't ask for a Series
, we asked for a scalar value which is the name of the sender. So we need to extract that value from the Series. If there's only one value in the Series (you need to check that), we can extract it using pandas.Series.values[0]
. So the code becomes like this:
reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
return reply_to_sender_values[0]
Now what would happen if I gave you a number that you didn't find in message_id
? What would you do? You'd tell me that you found nothing. That translates to this:
reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
return reply_to_sender_values[0]
else:
return ""
There's one more thing we need to pay attention to. As you said, the values in reply_to_id
can be zero. So we need to take care of that:
if(reply_to_id != 0):
reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
return reply_to_sender_values[0]
else:
return ""
else:
return ""
As you can see, we've just built a function to do what you would do by hand. Let's give it a name:
def reply_to_sender(reply_to_id):
if(reply_to_id != 0):
reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
return reply_to_sender_values[0]
else:
return ""
else:
return ""
All there's left to do is to find a way to apply this function on all the rows in reply_to_id
column of our DataFrame
. Luckily, there's this method in Pandas
that does just that. And it's called, you guessed it, pandas.DataFrame.apply
. Now it all comes together with this line of code:
data["reply_to_sender"] = data["reply_to_id"].apply(lambda x: reply_to_sender(x))
One thing to notice, I tested this code in a Jupyter Notebook. If you want to run this code from a script, you need to pass the DataFrame
to your reply_to_sender
function. So the code changes to this:
def reply_to_sender(data, reply_to_id):
if(reply_to_id != 0):
reply_to_sender_values = data.loc[data["message_id"] == reply_to_id]["sender"].values
if len(reply_to_sender_values) == 1:
return reply_to_sender_values[0]
else:
return ""
else:
return ""
data["reply_to_sender"] = data["reply_to_id"].apply(lambda x: reply_to_sender(data, x))
Upvotes: 0