Reputation: 607
I'm a bit stuck here trying to determine how to slice my dataframe.
data = {'Date' : ['08/20/10','08/20/10','08/20/10','08/21/10','08/22/10','08/24/10','08/25/10','08/26/10'] , 'Receipt' : [10001,10001,10002,10002,10003,10004,10004,10004],
'Product' : ['xx1','xx2','yy1','fff4','gggg4','fsf4','gggh5','hhhg6']}
dfTest = pd.DataFrame(data)
dfTest
This will produce:
Date Product Receipt
0 08/20/10 xx1 10001
1 08/20/10 xx2 10001
2 08/20/10 yy1 10002
3 08/21/10 fff4 10002
4 08/22/10 gggg4 10003
5 08/24/10 fsf4 10004
6 08/25/10 gggh5 10004
7 08/26/10 hhhg6 10004
I want to create a new dataframe that only contains unique receipts, meaning the receipt should only be used on 1 day only (but it can be shown multiple times in 1 day). If the receipt shows up in multiple days, it needs to be removed. The above data set should look like this:
Date Product Receipt
0 08/20/10 xx1 10001
1 08/20/10 xx2 10001
2 08/22/10 gggg4 10003
What I have done so far is:
dfTest.groupby(['Receipt','Date']).count()
Product
Receipt Date
10001 08/20/10 2
10002 08/20/10 1
08/21/10 1
10003 08/22/10 1
10004 08/24/10 1
08/25/10 1
08/26/10 1
I didn't know how to do a query for that date in that kind of structure, so I reset the index.
df1 = dfTest.groupby(['Receipt','Date']).count().reset_index()
Receipt Date Product
0 10001 08/20/10 2
1 10002 08/20/10 1
2 10002 08/21/10 1
3 10003 08/22/10 1
4 10004 08/24/10 1
5 10004 08/25/10 1
6 10004 08/26/10 1
Now I'm not sure how to proceed. I hope someone out there can lend a helping hand. This might be easy, I'm just a bit confused or lacking in experience.
Upvotes: 1
Views: 102
Reputation: 862681
You can use SeriesGroupBy.nunique
with boolean indexing where condition use Series.isin
:
df1 = dfTest.groupby(['Receipt'])['Date'].nunique()
print (df1)
Receipt
10001 1
10002 2
10003 1
10004 3
Name: Date, dtype: int64
#get indexes of all rows where length is 1
print (df1[df1 == 1].index)
Int64Index([10001, 10003], dtype='int64', name='Receipt')
#get all rows where in column Receipt are indexes with length 1
print (dfTest[dfTest['Receipt'].isin(df1[df1 == 1].index)])
Date Product Receipt
0 08/20/10 xx1 10001
1 08/20/10 xx2 10001
4 08/22/10 gggg4 10003
Another solution where find indexes by condition and then select DataFrame
by loc
:
print (dfTest.groupby(['Receipt']).filter(lambda x: x.Date.nunique()==1).index)
Int64Index([0, 1, 4], dtype='int64')
df1 = dfTest.loc[dfTest.groupby(['Receipt']).filter(lambda x: x.Date.nunique()==1).index]
print (df1)
Date Product Receipt
0 08/20/10 xx1 10001
1 08/20/10 xx2 10001
4 08/22/10 gggg4 10003
Upvotes: 1