Ben
Ben

Reputation: 998

Exclude all rows that differ in only one column

I have a large(ish) set of experimental data that contains pairs of values. Each pair is associated with a particular barcode. Ideally, each pair should have a unique barcode. Unfortunately, it turns out that I screwed something up during the experiment. Now several pairs share a single barcode. I need to exclude these pairs/barcodes from my analysis.

My data looks kind of like this:

The pairs are in columns 'A' and 'B' -- I just included 'X' to represent some arbitrary associated data:

df = pd.DataFrame({'Barcode' : ['AABBCC', 'AABBCC', 'BABACC', 'AABBCC', 'DABBAC', 'ABDABD', 'DABBAC'],
               'A' : ['v', 'v', 'x', 'y', 'z', 'h', 'z'],
               'B' : ['h', 'h', 'j', 'k', 'l', 'v', 'l'],
               'X' : np.random.randint(10, size = 7)})
df = df[['Barcode', 'A', 'B', 'X']]
df

    Barcode A   B   X
0   AABBCC  v   h   8
1   AABBCC  v   h   7
2   BABACC  x   j   2
3   AABBCC  y   k   3
4   DABBAC  z   l   8
5   ABDABD  h   v   0
6   DABBAC  z   l   4

I want to get rid of the rows described by barcode 'AABBCC', since this barcode is associated with two different pairs (rows 0 and 1 are both the same pair -- which is fine -- but, row 3 is a different pair).

df.loc[df.Barcode != 'AABBCC']

    Barcode A   B   X
2   BABACC  x   j   6
4   DABBAC  z   l   0
5   ABDABD  h   v   7
6   DABBAC  z   l   5

My solution thus far:

def duplicates(bar):
    if len(df.loc[df.Barcode == bar].A.unique()) > 1 or len(df.loc[df.Barcode == bar].B.unique()) > 1:
        return 'collision'
    else:
        return 'single'

df['Barcode_collision'] = df.apply(lambda row: duplicates(row['Barcode']), axis = 1)
df.loc[df.Barcode_collision == 'single']

    Barcode A   B   X   Barcode_collision
2   BABACC  x   j   6   single
4   DABBAC  z   l   0   single
5   ABDABD  h   v   7   single
6   DABBAC  z   l   5   single

Unfortunately, this is very slow with a large dataframe (~500,000 rows) using my delicate computer. I'm sure there must be a better/faster way. Maybe using the groupby function?

df.groupby(['Barcode', 'A', 'B']).count()

                X
Barcode A   B   
AABBCC  v   h   2
        y   k   1
ABDABD  h   v   1
BABACC  x   j   1
DABBAC  z   l   2

Then filtering out rows that have more than one value in the second or third indexes? But my brain and my googling skills can't seem to get me further than this...

Upvotes: 1

Views: 92

Answers (1)

jezrael
jezrael

Reputation: 862751

You can use filter:

print(df.groupby('Barcode').filter(lambda x: ((x.A.nunique() == 1) or (x.B.nunique() == 1))))

  Barcode  A  B  X Barcode_collision
2  BABACC  x  j  4            single
4  DABBAC  z  l  9            single
5  ABDABD  h  v  3            single
6  DABBAC  z  l  9            single

Another solution with transform and boolean indexing:

g = df.groupby('Barcode')
A = g.A.transform('nunique')
B = g.B.transform('nunique')

print (df[(A == 1) | (B == 1)])
  Barcode  A  B  X Barcode_collision
2  BABACC  x  j  2            single
4  DABBAC  z  l  6            single
5  ABDABD  h  v  1            single
6  DABBAC  z  l  3            single

Upvotes: 1

Related Questions