neversaint
neversaint

Reputation: 64074

Using regex compile through loop in Python

I have a text which I want to match through words in a given set. After matching it will simply tag them. The code is this

mytext = "xxxxx repA1 yyyy REPA1 zzz."
geneset = {'leuB', 'repA1'} # The actual length is ~1Million entries

result = mytext
for gene in geneset:
    regexp = re.compile(gene, flags=re.IGNORECASE)
    result = re.sub(regexp, r'<GENE>\g<0></GENE>', mytext)

print result

The expected output is:

xxxxx <GENE>repA1</GENE> yyyy <GENE>REPA1</GENE> zzz.

But why the code above failed to generate the results?

Upvotes: 0

Views: 496

Answers (2)

Stefan van den Akker
Stefan van den Akker

Reputation: 7009

You should change mytext in re.sub to result. That way you update the variable result each time you loop over geneset, instead of starting with the original (and not-updated) string mytext on every iteration.

for gene in geneset:
    regexp = re.compile(r"(?i)({})".format(gene))
    result = re.sub(regexp, r'<GENE>\g<1></GENE>', result)

Upvotes: 1

soloidx
soloidx

Reputation: 749

In your code, you are using the re.sub over the original text (that no are changing in each loop), if you use instead the result variable like result = re.sub(regexp, r'<GENE>\g<0></GENE>', result) the output will be correct.

Upvotes: 2

Related Questions