Reputation: 179
I have following dataframe:
dct ={'store':('A','A','A','A','A','B','B','B','C','C','C'),
'station':('aisle','aisle','aisle','window','window','aisle','aisle','aisle','aisle','window','window'),
'produce':('apple','apple','orange','orange','orange','apple','apple','orange','apple','apple','orange')}
df = pd.DataFrame(dct)
print(df)
store station produce
A aisle apple
A aisle apple
A aisle orange
A window orange
A window orange
B aisle apple
B aisle apple
B aisle orange
C aisle apple
C window apple
C window orange
subset df based on: [Count of duplicate data based on store, station, and produce] is not same as [total count based on store, station, & produce]. In other words, if any store has only duplicate rows based on store, station, and produce then remove it however include rows even if one non-duplicate record found:
Expected dataframe walkthrough
store station produce
A aisle apple
A aisle apple
A aisle orange
A window orange ->exclude because store, station and produce match
A window orange ->exclude because store, station and produce match
B aisle apple
B aisle apple
B aisle orange
C aisle apple
C window apple
C window orange
expected dataframe:
store station produce
A aisle apple
A aisle apple
A aisle orange
B aisle apple
B aisle apple
B aisle orange
C aisle apple
C window apple
C window orange
From Store 'B' apple is included because 'orange' is also present for same store-station makes it exception. Conceptually, I understand what to do, but unable to translate it in the code.
s = (df.duplicated(subset = ['store','station','produce'], keep=False))
sample = df[df.groupby(['store','station'])['station_ID'].sum().eq(dupli_count)] --> something going wrong here
Upvotes: 3
Views: 47
Reputation: 323326
We can try groupby
with transform
nunique
df = df[df.groupby(['store', 'station'])['produce'].transform('nunique')!=1]
Out[43]:
store station produce
0 A aisle apple
1 A aisle apple
2 A aisle orange
5 B aisle apple
6 B aisle apple
7 B aisle orange
9 C window apple
10 C window orange
Update if we would like keep the one row only group
g = df.groupby(['store', 'station'])['produce']
df = df[(g.transform('nunique')!=1) | (g.transform('count')==1)]
df
Out[46]:
store station produce
0 A aisle apple
1 A aisle apple
2 A aisle orange
5 B aisle apple
6 B aisle apple
7 B aisle orange
8 C aisle apple
9 C window apple
10 C window orange
Upvotes: 2