Comparing 2 dataframe columns to 2 numpy array values in the same row

Question

I have a dataframe. It contains df['article_id'] and df['user_id']. I also have a numpy array(or a list. I figured np array would be faster for this). Which contains an article_id and a user_id. The point is to compare the df with the np array so I can filter out duplicate entries. Both user_id and article_id need to be the same value. So the idea is:

if df['article_id'] == nparray[:,0] & df['user_id'] == nparray[:,1]:
    remove the row from the dataframe

Here's what the df & np.array/list look like(as of now there is only 1 user_id but there will be more later). So if the np.array contains the same values from the dataframe, the dataframe rows should be deleted.:

array([[1127087222,          1],
       [1202623831,          1],
       [1747352473,          1],
       [1748645480,          1],
       [1759957596,          1],
       [1811054956,          1]])

    user_id article_id  date_saved
0   1   2579244390  2019-05-09 10:46:23
1   1   2580336884  2019-05-09 10:46:22
2   1   1202623831  2019-05-09 10:46:20
3   1   2450784233  2019-01-11 12:36:44
4   1   1747352473  2019-01-03 21:38:34

Desired output:

    user_id article_id  date_saved
0   1   2579244390  2019-05-09 10:46:23
1   1   2580336884  2019-05-09 10:46:22
3   1   2450784233  2019-01-11 12:36:44

How can I achieve this?

Andy L. · Accepted Answer

After your clarification. You may achieve your desired output using np.isin and negate operator '~' as follows:

df[~np.isin(df[['user_id', 'article_id']], nparray)]

Out[17]:
   user_id  article_id           date_saved
0        1  2579244390  2019-05-09 10:46:23
1        1  2580336884  2019-05-09 10:46:22
3        1  2450784233  2019-01-11 12:36:44

Comparing 2 dataframe columns to 2 numpy array values in the same row

Answers (1)

Related Questions