Reputation: 23
I am trying in Python 3.7 to recognize patterns in pdf documents by extracting the elements with regular expressions. The problem I have is related to the fact that I need only the first match of the regular expression. However when I occupy my regex it finds both.
"FECHA DE EMISION ","26/03/2021 "
"Comuna: ","Valparaiso "
"FECHA DE EMISION ","26/03/2021 "
The regex I am using is:
(FECHA\sDE\sEMISION.*)
The result I need is just the first match of the regex to get:
"FECHA DE EMISION ","26/03/2021 "
It is important to note that the two matches they make are the same content.
I also tried to use the Contents statement \g<1>
capture group 1, but it didn't work for me. I think it has to do with that I am not using lazy greedy.
It is important to note that I cannot solve it directly with Python or with functionalities of it. I specifically use re.findall
, but I can't add any other additional functionality, that's why I need an expression that resolves to bring me only the first match.
Any idea how to solve it?
Upvotes: 2
Views: 320
Reputation: 626699
If you could use PCRE/Onigmo/Boost regex engine or PyPi regex
module, you could get the match value directly using
\A[\s\S]*?\K"FECHA\sDE\sEMISION.*
where \K
makes the regex engine "forget" the text matched so far. See this regex demo.
Since you are bound to use a pattern for re.findall
, you can use
\A[\s\S]*?("FECHA\sDE\sEMISION.*)
See the regex demo.
Details:
\A
- unambiguous start of string[\s\S]*?
- any zero or more chars, as few as possible("FECHA\sDE\sEMISION.*)
- Capturing group 1: "FECHA DE EMISION
with any whitespace between the words and then the rest of the line.Upvotes: 1