NegalWoods
NegalWoods

Reputation: 23

Get the first match with re.findall without access to any Python code

I am trying in Python 3.7 to recognize patterns in pdf documents by extracting the elements with regular expressions. The problem I have is related to the fact that I need only the first match of the regular expression. However when I occupy my regex it finds both.

"FECHA DE EMISION ","26/03/2021 "
"Comuna: ","Valparaiso "
"FECHA DE EMISION ","26/03/2021 "

The regex I am using is:

(FECHA\sDE\sEMISION.*)

The result I need is just the first match of the regex to get:

"FECHA DE EMISION ","26/03/2021 "

It is important to note that the two matches they make are the same content.

I also tried to use the Contents statement \g<1> capture group 1, but it didn't work for me. I think it has to do with that I am not using lazy greedy.

It is important to note that I cannot solve it directly with Python or with functionalities of it. I specifically use re.findall, but I can't add any other additional functionality, that's why I need an expression that resolves to bring me only the first match.

Any idea how to solve it?

Upvotes: 2

Views: 320

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626699

If you could use PCRE/Onigmo/Boost regex engine or PyPi regex module, you could get the match value directly using

\A[\s\S]*?\K"FECHA\sDE\sEMISION.*

where \K makes the regex engine "forget" the text matched so far. See this regex demo.

Since you are bound to use a pattern for re.findall, you can use

\A[\s\S]*?("FECHA\sDE\sEMISION.*)

See the regex demo.

Details:

  • \A - unambiguous start of string
  • [\s\S]*? - any zero or more chars, as few as possible
  • ("FECHA\sDE\sEMISION.*) - Capturing group 1: "FECHA DE EMISION with any whitespace between the words and then the rest of the line.

Upvotes: 1

Related Questions