noblerthanoedipus
noblerthanoedipus

Reputation: 516

Find duplicates in mulitple columns and drop rows - Pandas

If the name appears in any subsequent row, I want to drop that row. Mainly i'm not sure how to get the index of that found duplicate and then use that index number to drop it from df.

import pandas as pd
data = {'interviewer': ['Jason', 'Molly', 'Jermaine', 'Jake', 'Amy'], 
        'candidate': ['Bob', 'Jermaine', 'Ahmed', 'Karl', 'Molly'],
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3]}

df = pd.DataFrame(data)
#names = pd.unique(df[['interviewer', 'candidate']].values.ravel()).tolist()

mt = []

for i, c in zip(df.interviewer, df.candidate):
    print i, c
    if i not in mt:
        if c not in mt:
            mt.append(df.loc[(df.interviewer == i) & (df.candidate == c)] )
    else:
        continue

My thinking was use mt as a list to pass to df.drop and drop the rows with those indices. The result I want is without seeing Molly or Jermaine appear again in indices 2 or 4 - df.drop([2,4], inplace=True).

EDITED

I've figured out a way to create the list of indices I want to pass to drop:

import pandas as pd
data = {'interviewer': ['Jason', 'Molly', 'Jermaine', 'Jake', 'Amy'], 
        'candidate': ['Bob', 'Jermaine', 'Ahmed', 'Karl', 'Molly'],
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3]}

df = pd.DataFrame(data)
#print df
counter = -1
bad_rows = []
names = []
for i, c in zip(df.interviewer, df.candidate):
    print i, c

    counter += 1
    print counter
    if i not in names:
        names.append(i)
    else:
        bad_rows.append(counter)
    if c not in names:
        names.append(c)
    else:
        bad_rows.append(counter)

#print df.drop(bad_rows)

However there has to be a smarter way to do this, maybe something along @Ami_Tavory answer for itertools?

Upvotes: 1

Views: 86

Answers (2)

noblerthanoedipus
noblerthanoedipus

Reputation: 516

I made a function for what I want to do. Using df.index makes it safe to use for any numerical index.

def drop_dup_rows(df):
    names = []
    for i, c, ind in zip(df.interviewer, df.candidate, df.index.tolist()):
        if any(x in names for x in [i, c]):
            df.drop(ind, inplace=True)
        else:
            names.extend([i,c])
    return df

Upvotes: 0

Ami Tavory
Ami Tavory

Reputation: 76346

(At the time when this answer was written, there was some discrepancy between the verbal description and the code example.)

You can use isin to check if an item appears in a different column, like so:

In [5]: df.candidate.isin(df.interviewer)
Out[5]: 
0    False
1     True
2    False
3    False
4     True
Name: candidate, dtype: bool

Consequently, you can do something like

df[~df.candidate.isin(df.interviewer)]

Note that this matches your original code, not your specification of subsequent rows. If you want to drop only if the rows are subsequent, I'd go with itertools, something like:

In [18]: bads = [i for ((i, cn), (j, iv)) in itertools.product(enumerate(df.candidate), enumerate(df.interviewer)) if j >=i and cn == iv]

In [19]: df[~df.index.isin(bads)]
Out[19]: 
  candidate interviewer  reports  year
0       Bob       Jason        4  2012
2     Ahmed    Jermaine       31  2013
3      Karl        Jake        2  2014
4     Molly         Amy        3  2014

Also, if you want to drop the subsequent rows, simply change things to

In [18]: bads = [j for ((i, cn), (j, iv)) in itertools.product(enumerate(df.candidate), enumerate(df.interviewer)) if j >=i and cn == iv]

Upvotes: 1

Related Questions