Reputation: 47
Using the pandas library in Python, I have a device in my code that looks like this:
BadData = len(df[df.A1.str.contains('A|T|C|G')==False])
What I'm trying to do here is count the number of entries in the A1
column of the dataframe df
that do not contain any combination of the letters A, T, C, and G.
These expressions should be counted as BadData
:
But these expressions should not:
My question: how could I use regex characters to include entries like "Apple" or "Golfing" in BadData
?
I could chain together conditions like so:
BadData = len(df[(df.A1.str.contains('A|T|C|G')==False) & (df.A1.str.contains('0|1|2|3')==TRUE)])
But here I face a difficulty: do I have to define every character that violates the condition? This seems clumsy, and I am sure there is a more elegant way.
Upvotes: 1
Views: 66
Reputation: 51395
You can use:
df['A1'].str.contains('^[ACTG]+$')
Which makes sure that it both starts (the regex ^
) and ends (the regex $
) with a letter in ACTG
, and only contains one or more of those characters.
To get the len
, you can just sum the False
values:
bad_data = sum(~df['A1'].str.contains('^[ACTG]+$'))
Which is equivalent to:
bad_data = len(df[df.A1.str.contains('^[ACTG]+$')==False])
But IMO nicer to read.
For example:
>>> df
A1
0 Apple
1 Golfing
2 A
3 ATTC
4 ACGT
5 AxTCG
6 foo
7 %
8 ACT Golf GTC
9 ACT
>>> df['A1'].str.contains('^[ACTG]+$')
0 False
1 False
2 True
3 True
4 True
5 False
6 False
7 False
8 False
9 True
Name: A1, dtype: bool
bad_data = sum(~df['A1'].str.contains('^[ACTG]+$'))
# 6
Upvotes: 1