Reputation: 64074
I have a text which I want to match through words in a given set. After matching it will simply tag them. The code is this
mytext = "xxxxx repA1 yyyy REPA1 zzz."
geneset = {'leuB', 'repA1'} # The actual length is ~1Million entries
result = mytext
for gene in geneset:
regexp = re.compile(gene, flags=re.IGNORECASE)
result = re.sub(regexp, r'<GENE>\g<0></GENE>', mytext)
print result
The expected output is:
xxxxx <GENE>repA1</GENE> yyyy <GENE>REPA1</GENE> zzz.
But why the code above failed to generate the results?
Upvotes: 0
Views: 496
Reputation: 7009
You should change mytext
in re.sub
to result
. That way you update the variable result
each time you loop over geneset
, instead of starting with the original (and not-updated) string mytext
on every iteration.
for gene in geneset:
regexp = re.compile(r"(?i)({})".format(gene))
result = re.sub(regexp, r'<GENE>\g<1></GENE>', result)
Upvotes: 1
Reputation: 749
In your code, you are using the re.sub
over the original text (that no are changing in each loop), if you use instead the result variable like result = re.sub(regexp, r'<GENE>\g<0></GENE>', result)
the output will be correct.
Upvotes: 2