Get the first match with re.findall without access to any Python code

Question

I am trying in Python 3.7 to recognize patterns in pdf documents by extracting the elements with regular expressions. The problem I have is related to the fact that I need only the first match of the regular expression. However when I occupy my regex it finds both.

"FECHA DE EMISION ","26/03/2021 "
"Comuna: ","Valparaiso "
"FECHA DE EMISION ","26/03/2021 "

The regex I am using is:

(FECHA\sDE\sEMISION.*)

The result I need is just the first match of the regex to get:

"FECHA DE EMISION ","26/03/2021 "

It is important to note that the two matches they make are the same content.

I also tried to use the Contents statement \g<1> capture group 1, but it didn't work for me. I think it has to do with that I am not using lazy greedy.

It is important to note that I cannot solve it directly with Python or with functionalities of it. I specifically use re.findall, but I can't add any other additional functionality, that's why I need an expression that resolves to bring me only the first match.

Any idea how to solve it?

Wiktor Stribiżew · Accepted Answer

If you could use PCRE/Onigmo/Boost regex engine or PyPi regex module, you could get the match value directly using

\A[\s\S]*?\K"FECHA\sDE\sEMISION.*

where \K makes the regex engine "forget" the text matched so far. See this regex demo.

Since you are bound to use a pattern for re.findall, you can use

\A[\s\S]*?("FECHA\sDE\sEMISION.*)

See the regex demo.

Details:

\A - unambiguous start of string
[\s\S]*? - any zero or more chars, as few as possible
("FECHA\sDE\sEMISION.*) - Capturing group 1: "FECHA DE EMISION with any whitespace between the words and then the rest of the line.

Get the first match with re.findall without access to any Python code

Answers (1)

Related Questions