Reputation: 4214
I have user/item data for a recommender. I'm splitting it into test and train data, and I need to be sure that any new users or items in the test data are omitted before evaluating the recommender. My approach works for small datasets, but when it gets big, it takes for ever. Is there a better way to do this?
# Test set for removing users or items not in train
te = pd.DataFrame({'user': [1,2,3,1,6,1], 'item':[16,12,19,15,13,12]})
tr = pd.DataFrame({'user': [1,2,3,4,5], 'item':[11,12,13,14,15]})
print "Training_______"
print tr
print "\nTesting_______"
print te
# By using two joins and selecting the proper indices, all 'new' members of test set are removed
b = pd.merge( pd.merge(te,tr, on='user', suffixes=['', '_d']) , tr, on='item', suffixes=['', '_d'])[['user', 'item']]
print "\nSolution_______"
print b
Gives:
Training_______
item user
0 11 1
1 12 2
2 13 3
3 14 4
4 15 5
Testing_______
item user
0 16 1
1 12 2
2 19 3
3 15 1
4 13 6
5 12 1
Solution_______
user item
0 1 15
1 1 12
2 2 12
The solution is correct (any new users or items cause the whole row to be removed from test. But it is just slow at scale.
Thanks in advance.
Upvotes: 2
Views: 1346
Reputation: 375485
I think you can achieve what you want using the isin
Series method on each of the columns:
In [11]: te['item'].isin(tr['item']) & te['user'].isin(tr['user'])
Out[11]:
0 False
1 True
2 False
3 True
4 False
5 True
dtype: bool
In [12]: te[te['item'].isin(tr['item']) & te['user'].isin(tr['user'])]
Out[12]:
item user
1 12 2
3 15 1
5 12 1
In 0.13 you'll be able to use the new DataFrame isin
method (on current master):
In [21]: te[te.isin(tr.to_dict(outtype='list')).all(1)]
Out[21]:
item user
1 12 2
3 15 1
5 12 1
hopefully by release the syntax should be a bit better on release:
te[te.isin(tr).all(1)]
Upvotes: 5