Tomas
Tomas

Reputation: 305

Python/Pandas - Replacing an element in one dataframe with a value from another dataframe

I have an issue with replacing an element in one pandas DataFrame by a value from another pandas DataFrame. Apologies for the long post. I have tried to give many inbetween examples to clarify my problem. I use Python 2.7.11 (Anaconda 4.0.0, 64bit).

Data

I have a pandas DataFrame containing many user item pairs. This DataFrame (let's call it the initial_user_item_matrix) is of the form:

   userId itemId  interaction
1       1      1            1
2       1      2            0
3       1      3            1
4       1      4            1
5       2      9            1
6       3      3            1
7       3      5            0

Furthermore, I have a DataFrame containing only the user item pairs of user 1. I call this the cold_user_item_matrix, this DataFrame is of the form:

   userId itemId  interaction
1       1      1            1
2       1      2            0
3       1      3            1
4       1      4            1

Next, I have a numpy ndarray with items, which I call the ranked_items. It is of the form:

[9 5 3 4]

Finally, I change the interactions of user 1 in the initial_user_item_matrix to NaN's which gives the following DataFrame (call it new_user_item_matrix):

   userId itemId  interaction
1       1      1          NaN
2       1      2          NaN
3       1      3          NaN
4       1      4          NaN
5       2      9            1
6       3      3            1
7       3      5            0

What do I want to achieve?

I want to change the interaction of the user 1 - item pairs in the new_user_item_matrix (currently NaN's) to the value of that particular interaction in the initial_user_item_matrix IF AND ONLY IF the item is contained in the ranked_items array. Afterwards, all user item pairs (rows of the DataFrame) where the interaction is still NaN should be removed (user 1 - item pairs for which the itemId is not in ranked_items). See below what the result set should look like.

Inbetween result:

   userId itemId  interaction
1       1      1          NaN
2       1      2          NaN
3       1      3            1
4       1      4            1
5       2      9            1
6       3      3            1
7       3      5            0

Final result:

   userId itemId  interaction
3       1      3            1
4       1      4            1
5       2      9            1
6       3      3            1
7       3      5            0

What have I tried?

This is my code:

for item in ranked_items:
    if new_user_item_matrix.loc[new_user_item_matrix['userId']==cold_user].loc[new_user_item_matrix['itemId']==item].empty:
        pass
    else: new_user_item_matrix.replace(to_replace=new_user_item_matrix.loc[new_user_item_matrix['userId']==1].loc[new_user_item_matrix['itemId']==item].iloc[0,2],value=cold_user_item_matrixloc[cold_user_item_matrix['itemId']==item].iloc[0,2],inplace=True)

new_user_item_matrix.dropna(axis=0,how='any',inplace=True)

What does it do? It loops over all items in the ranked_items array. First, it checks whether user 1 has interacted with the item (the if-part of the if statement). If not, then go to the next item in the ranked_items array (pass). If user 1 did interact with the item (the else-part of the if statement), replace the interaction of user 1 with the item from the new_user_item_matrix (currently a NaN) by the value of the interaction of user 1 with the item from the cold_user_item_matrix, which is either a 1 or a 0 (I hope you are all still with me).

What is going wrong?

The if-part of the if statement does not give any problems. It is going wrong when I'm trying to replace the value from the new_user_item_matrix (the else-part of the if statement). When replacing the particular element (the interaction), it does not only replace that element, but also ALL other values that are NaN in the new_user_item_matrix. To illustrate, if the loop starts, it first loops over itemId's 9 and 5, which user 1 has not interacted with (hence nothing happens). Next, it loops over itemId 3, and the interaction for userId 1 and itemId 3 should change from NaN to 0. But it does not only change the interaction for userId 1 and itemId 3 to 0, but also all other interactions of user 1 that are NaN's. Giving the following result set:

   userId itemId  interaction
1       1      1            1
2       1      2            1
3       1      3            1
4       1      4            1
5       2      9            1
6       3      3            1
7       3      5            0

Which is obviously incorrect, as itemId 1 and 2 are not in the ranked_items array and hence their true interaction should not be uncovered. Also, the interaction (a 1) for user 1 and itemId 3 are filled in for all interactions (even if their interaction is not a 1 but a 0).

Anybody that can help me out here?

Upvotes: 2

Views: 1406

Answers (1)

Michael Hoff
Michael Hoff

Reputation: 6318

Short solution

In essence, you want to throw away all item interactions for a given user, but only for items which are not ranked.

To make the proposed solutions more readable, assume df = initial_user_item_matrix.

Simple row selection with boolean conditions (generates a read-only view on the original df):

filtered_df = df[(df.userID != 1) | df.itemID.isin(ranked_items)]

Similar solution modifying the dataframe in-place by dropping "invalid" rows:

df.drop(df[(df.userID == 1) & ~df.itemID.isin(ranked_items)].index, inplace=True)

Step by step solution using all intermediate constructs

Assuming all above mentioned intermediate artifacts are required, the desired result can be obtained as follows:

import pandas as pd
import numpy as np

initial_user_item_matrix = pd.DataFrame([[1, 1, 1], 
                                        [1, 2, 0], 
                                        [1, 3, 1], 
                                        [1, 4, 1], 
                                        [2, 9, 1], 
                                        [3, 3, 1], 
                                        [3, 5, 0]],
                                        columns=['userID', 'itemID', 'interaction'])
print("initial_user_item_matrix\n{}\n".format(initial_user_item_matrix))

ranked_items = np.array([9, 5, 3, 4]) 

cold_user = 1 

cold_user_item_matrix = initial_user_item_matrix.loc[initial_user_item_matrix.userID == cold_user]
print("cold_user_item_matrix\n{}\n".format(cold_user_item_matrix))

new_user_item_matrix = initial_user_item_matrix.copy()
new_user_item_matrix.ix[new_user_item_matrix.userID == cold_user, 'interaction'] = np.NaN
print("new_user_item_matrix\n{}\n".format(new_user_item_matrix))

new_user_item_matrix.ix[new_user_item_matrix.userID == cold_user, 'interaction'] = cold_user_item_matrix.apply(lambda r: r.interaction if r.itemID in ranked_items else np.NaN, axis=1)
print("new_user_item_matrix after replacing\n{}\n".format(new_user_item_matrix))

new_user_item_matrix.dropna(inplace=True)
print("new_user_item_matrix after dropping nans\n{}\n".format(new_user_item_matrix))

produces

initial_user_item_matrix
   userID  itemID  interaction
0       1       1            1
1       1       2            0
2       1       3            1
3       1       4            1
4       2       9            1
5       3       3            1
6       3       5            0

cold_user_item_matrix
   userID  itemID  interaction
0       1       1            1
1       1       2            0
2       1       3            1
3       1       4            1

new_user_item_matrix
   userID  itemID  interaction
0       1       1          NaN
1       1       2          NaN
2       1       3          NaN
3       1       4          NaN
4       2       9            1
5       3       3            1
6       3       5            0

new_user_item_matrix after replacing
   userID  itemID  interaction
0       1       1          NaN
1       1       2          NaN
2       1       3            1
3       1       4            1
4       2       9            1
5       3       3            1
6       3       5            0

new_user_item_matrix after dropping nans
   userID  itemID  interaction
2       1       3            1
3       1       4            1
4       2       9            1
5       3       3            1
6       3       5            0

Upvotes: 4

Related Questions