zbinsd
zbinsd

Reputation: 4214

Python Pandas - Remove values from first dataframe if not in second dataframe

I have user/item data for a recommender. I'm splitting it into test and train data, and I need to be sure that any new users or items in the test data are omitted before evaluating the recommender. My approach works for small datasets, but when it gets big, it takes for ever. Is there a better way to do this?

# Test set for removing users or items not in train
te = pd.DataFrame({'user': [1,2,3,1,6,1], 'item':[16,12,19,15,13,12]})
tr = pd.DataFrame({'user': [1,2,3,4,5], 'item':[11,12,13,14,15]})
print "Training_______"
print tr
print "\nTesting_______"
print te

# By using two joins and selecting the proper indices, all 'new' members of test set are removed
b = pd.merge( pd.merge(te,tr, on='user', suffixes=['', '_d']) , tr, on='item', suffixes=['', '_d'])[['user', 'item']]
print "\nSolution_______"
print b

Gives:

Training_______
   item  user
0    11     1
1    12     2
2    13     3
3    14     4
4    15     5

Testing_______
   item  user
0    16     1
1    12     2
2    19     3
3    15     1
4    13     6
5    12     1

Solution_______
   user  item
0     1    15
1     1    12
2     2    12

The solution is correct (any new users or items cause the whole row to be removed from test. But it is just slow at scale.

Thanks in advance.

Upvotes: 2

Views: 1346

Answers (1)

Andy Hayden
Andy Hayden

Reputation: 375485

I think you can achieve what you want using the isin Series method on each of the columns:

In [11]: te['item'].isin(tr['item']) & te['user'].isin(tr['user'])
Out[11]:
0    False
1     True
2    False
3     True
4    False
5     True
dtype: bool

In [12]: te[te['item'].isin(tr['item']) & te['user'].isin(tr['user'])]
Out[12]:
   item  user
1    12     2
3    15     1
5    12     1

In 0.13 you'll be able to use the new DataFrame isin method (on current master):

In [21]: te[te.isin(tr.to_dict(outtype='list')).all(1)]
Out[21]:
   item  user
1    12     2
3    15     1
5    12     1

hopefully by release the syntax should be a bit better on release:

te[te.isin(tr).all(1)]

Upvotes: 5

Related Questions