Reputation: 1644
In simple terms I'm looking for the quickest way to search for a set of words in a string using regular expressions without using a for loop. i.e. is there a way to do this:
text = 'asdfadfgargqerno_TP53_dfgnafoqwefe_ATM_cvafukyhfjakhdfialb'
genes = set(['TP53','ATM','BRCA2'])
mutations = 0
if re.search( genes, text):
mutations += 1
print mutations
>>>1
The reason for this is because I'm searching a complicated data structure and don't want to nest too many loops. Here is problem code in more detail:
genes = set(['TP53','ATM','BRCA2'])
single_gene = 'ATM'
mutations = 0
data_dict = {
sample1=set(['AAA','BBB','TP53'])
sample2=set(['AAA','ATM','TP53'])
sample3=set(['AAA','CCC','XXX'])
sample4=set(['AAA','ZZZ','BRCA2'])
}
for sample in data_dict:
for gene in data_dict[sample]
if re.search( single_gene, gene):
mutations += 1
break
I can easily search for 'single_gene', but I want to search for 'genes'. If I add another for loop to iterate through 'genes' then the code will become more complicated because I will have to add another 'break' and a boolean to control when the break occurs? Functionally it works but is very clunky and there must be a more elegant way to do it? See my clunky extra loop for the set below (currently my only solution):
for sample in data_dict:
for gene in data_dict[sample]
MUT = False
for mut in genes:
if re.search( mut, gene):
mutations += 1
MUT = True
break
if MUT == True:
break
IMPORTANTLY: I am only looking to add 0 or 1 to 'mutations' if ANY gene from 'genes' occurs in the set for each sample. i.e. 'sample2' will add 1 to mutations and sample 3 will add 0. Let me know if anything needs further clarifying. Thanks in advance!
Upvotes: 0
Views: 87
Reputation: 43497
If your target strings are fixed text (that is, not regular expressions) don't use re
. It is far more efficient to:
for gene in genes:
if gene in text:
print('True')
there are variations on that theme such as:
if [gene for gene in genes if gene in text]:
...
which is pretty confusing to read, contains a list comprehension, and counts on the fact that an empty list []
is considered false in Python.
Updated because the question changed:
You are still doing it the hard way. Consider instead:
def find_any_gene(genes, text):
"""Returns True if any of the subsequences in genes
is found within text.
"""
for gene in genes:
if gene in text:
return True
return False
mutations = 0
text = '...'
for sample in data_dict:
for genes in data_dict[sample]
if find_any_gene(genes, text):
mutations += 1
This has the advantages of less code needed to short-circuit the search, greater readability, and the function find_any_gene()
can be called by other code.
Upvotes: 1
Reputation: 1163
Does this work? I used some examples from the comments.
Let me know if I am close?!
genes = set(['TP53','ATM','BRCA2', 'aaC', 'CDH'])
mutations = 0
data_dict = {
"sample1":set(['AAA','BBB','TP53']),
"sample2":set(['AAA','ATM','TP53']),
"sample3":set(['AAA','CCC','XXX']),
"sample4":set(['123CDH47aaCDHzz','ZZZ','BRCA2'])
}
for sample in data_dict:
for gene in data_dict[sample]:
if [ mut for mut in genes if mut in gene ]:
print "Found mutation: "+str(gene),
print "in sample: "+str(data_dict[sample])
mutations += 1
print mutations
Upvotes: 0