Reputation: 998
I have a large(ish) set of experimental data that contains pairs of values. Each pair is associated with a particular barcode. Ideally, each pair should have a unique barcode. Unfortunately, it turns out that I screwed something up during the experiment. Now several pairs share a single barcode. I need to exclude these pairs/barcodes from my analysis.
My data looks kind of like this:
The pairs are in columns 'A' and 'B' -- I just included 'X' to represent some arbitrary associated data:
df = pd.DataFrame({'Barcode' : ['AABBCC', 'AABBCC', 'BABACC', 'AABBCC', 'DABBAC', 'ABDABD', 'DABBAC'],
'A' : ['v', 'v', 'x', 'y', 'z', 'h', 'z'],
'B' : ['h', 'h', 'j', 'k', 'l', 'v', 'l'],
'X' : np.random.randint(10, size = 7)})
df = df[['Barcode', 'A', 'B', 'X']]
df
Barcode A B X
0 AABBCC v h 8
1 AABBCC v h 7
2 BABACC x j 2
3 AABBCC y k 3
4 DABBAC z l 8
5 ABDABD h v 0
6 DABBAC z l 4
I want to get rid of the rows described by barcode 'AABBCC', since this barcode is associated with two different pairs (rows 0 and 1 are both the same pair -- which is fine -- but, row 3 is a different pair).
df.loc[df.Barcode != 'AABBCC']
Barcode A B X
2 BABACC x j 6
4 DABBAC z l 0
5 ABDABD h v 7
6 DABBAC z l 5
My solution thus far:
def duplicates(bar):
if len(df.loc[df.Barcode == bar].A.unique()) > 1 or len(df.loc[df.Barcode == bar].B.unique()) > 1:
return 'collision'
else:
return 'single'
df['Barcode_collision'] = df.apply(lambda row: duplicates(row['Barcode']), axis = 1)
df.loc[df.Barcode_collision == 'single']
Barcode A B X Barcode_collision
2 BABACC x j 6 single
4 DABBAC z l 0 single
5 ABDABD h v 7 single
6 DABBAC z l 5 single
Unfortunately, this is very slow with a large dataframe (~500,000 rows) using my delicate computer. I'm sure there must be a better/faster way. Maybe using the groupby function?
df.groupby(['Barcode', 'A', 'B']).count()
X
Barcode A B
AABBCC v h 2
y k 1
ABDABD h v 1
BABACC x j 1
DABBAC z l 2
Then filtering out rows that have more than one value in the second or third indexes? But my brain and my googling skills can't seem to get me further than this...
Upvotes: 1
Views: 92
Reputation: 862751
You can use filter
:
print(df.groupby('Barcode').filter(lambda x: ((x.A.nunique() == 1) or (x.B.nunique() == 1))))
Barcode A B X Barcode_collision
2 BABACC x j 4 single
4 DABBAC z l 9 single
5 ABDABD h v 3 single
6 DABBAC z l 9 single
Another solution with transform
and boolean indexing
:
g = df.groupby('Barcode')
A = g.A.transform('nunique')
B = g.B.transform('nunique')
print (df[(A == 1) | (B == 1)])
Barcode A B X Barcode_collision
2 BABACC x j 2 single
4 DABBAC z l 6 single
5 ABDABD h v 1 single
6 DABBAC z l 3 single
Upvotes: 1