Reputation: 516
If the name appears in any subsequent row, I want to drop that row. Mainly i'm not sure how to get the index of that found duplicate and then use that index number to drop it from df.
import pandas as pd
data = {'interviewer': ['Jason', 'Molly', 'Jermaine', 'Jake', 'Amy'],
'candidate': ['Bob', 'Jermaine', 'Ahmed', 'Karl', 'Molly'],
'year': [2012, 2012, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data)
#names = pd.unique(df[['interviewer', 'candidate']].values.ravel()).tolist()
mt = []
for i, c in zip(df.interviewer, df.candidate):
print i, c
if i not in mt:
if c not in mt:
mt.append(df.loc[(df.interviewer == i) & (df.candidate == c)] )
else:
continue
My thinking was use mt
as a list to pass to df.drop
and drop the rows with those indices. The result I want is without seeing Molly or Jermaine appear again in indices 2 or 4 - df.drop([2,4], inplace=True)
.
EDITED
I've figured out a way to create the list of indices I want to pass to drop:
import pandas as pd
data = {'interviewer': ['Jason', 'Molly', 'Jermaine', 'Jake', 'Amy'],
'candidate': ['Bob', 'Jermaine', 'Ahmed', 'Karl', 'Molly'],
'year': [2012, 2012, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data)
#print df
counter = -1
bad_rows = []
names = []
for i, c in zip(df.interviewer, df.candidate):
print i, c
counter += 1
print counter
if i not in names:
names.append(i)
else:
bad_rows.append(counter)
if c not in names:
names.append(c)
else:
bad_rows.append(counter)
#print df.drop(bad_rows)
However there has to be a smarter way to do this, maybe something along @Ami_Tavory answer for itertools?
Upvotes: 1
Views: 86
Reputation: 516
I made a function for what I want to do. Using df.index
makes it safe to use for any numerical index.
def drop_dup_rows(df):
names = []
for i, c, ind in zip(df.interviewer, df.candidate, df.index.tolist()):
if any(x in names for x in [i, c]):
df.drop(ind, inplace=True)
else:
names.extend([i,c])
return df
Upvotes: 0
Reputation: 76346
(At the time when this answer was written, there was some discrepancy between the verbal description and the code example.)
You can use isin
to check if an item appears in a different column, like so:
In [5]: df.candidate.isin(df.interviewer)
Out[5]:
0 False
1 True
2 False
3 False
4 True
Name: candidate, dtype: bool
Consequently, you can do something like
df[~df.candidate.isin(df.interviewer)]
Note that this matches your original code, not your specification of subsequent rows. If you want to drop only if the rows are subsequent, I'd go with itertools
, something like:
In [18]: bads = [i for ((i, cn), (j, iv)) in itertools.product(enumerate(df.candidate), enumerate(df.interviewer)) if j >=i and cn == iv]
In [19]: df[~df.index.isin(bads)]
Out[19]:
candidate interviewer reports year
0 Bob Jason 4 2012
2 Ahmed Jermaine 31 2013
3 Karl Jake 2 2014
4 Molly Amy 3 2014
Also, if you want to drop the subsequent rows, simply change things to
In [18]: bads = [j for ((i, cn), (j, iv)) in itertools.product(enumerate(df.candidate), enumerate(df.interviewer)) if j >=i and cn == iv]
Upvotes: 1