Regex Python - Find every keyword instance, extract keyword and proceeding characters

Question

This is driving me insane.

I'm trying to find every instance of "DOI" or it's mis-scanned equivalents in a series of documents. I then want to collect the term "DOI" and up to 15 alpha numeric characters that should come after it. But I also need to ensure I find these even if they overlap with a previous match.

I've tried to extrapolate this previous solution I was given to another similar problem but with no success.

Python regex find all overlapping matches?

Here is the example I'm using to test this.

String to search :

"abhgfigDOI567afkgD0Idhdhfhfhdbvbkab3343432q3DO1fbaguig7ggkgafgkgDOIDOID01OO1"

DOI variations:

DOI|DO1|D01|D0I|001|00I|0O1|0OI|O01|O0I|OO1|OOI

Expected results:

["DOI567afkgD0Idhdhf",
"D0Idhdhfhfhdbvbkab",
"DO1fbaguig7ggkgafg",
"DOIDOID01OO1",
"DOID01OO1",
"D01OO1",
"001"]

Any assistance would be most appreciated!

Thanks!

John Machin · Accepted Answer

Using the "DOI variations" DOI|DO1|D01|D0I|001|00I|0O1|0OI|O01|O0I|OO1|OOI literally in that form is not a good idea. Start from basics: D+0+O, O+0, and I+1. This leads immediately to the pattern "[D0O][O0][I1]" which is much more compact, less error prone, and capable of much faster execution (if you wanted to get into Cython or C).

You can then IN THIS CASE use re.finditer() to find matching 3-character prefixes, and take it from there.

In a more general case e.g. the lead tag should be DOD instead of DOI, you can't use re.finditer():

Input text:     DODOD987654321
First match:    DODOD987654321
Second match:     DOD987654321 # Not found by re.finditer()

In the most general case (e.g. lead tag is DDD), you need to do re.search() in a loop, incrementing the search-start position by only 1 place after a successful match.

Regex Python - Find every keyword instance, extract keyword and proceeding characters

Answers (1)

Related Questions