hengyue li
hengyue li

Reputation: 468

regular expression can not find all

The following code can be run directly. What I want is to return a list: l = [1,2] (string). However, what I got is the string between the very first "begin" and the last "end". Even though this is one of the expected results. I can not find it out what happened.

import re

text = r'''

\begin{figure}
1
\end{figure}

aaa

\begin{figure}
2
\end{figure}

'''

pattern = r"\\begin{figure}([\s\S^f]*)\\end{figure}"
r = re.findall(pattern, text)


print(r)

Upvotes: 1

Views: 93

Answers (2)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 522626

Your pattern had multiple problems. Here is a working version:

text = r'''

\begin{figure}
1
\end{figure}

aaa

\begin{figure}
2
\end{figure}

'''

pattern = r"\\begin\{figure\}(?:(?!\\end\{figure\}).)*?(\d+).*?\\end\{figure\}"
nums = re.findall(pattern, text, flags=re.DOTALL)
print(nums)  # ['1', '2']

Notes: I am using a tempered dot to match the content after the leading \begin{figure} marker without crossing over the closing \end{figure} marker. I also use dot all mode here, so that .* can match across newlines. In addition, you had some regex metacharacters, such as {, which needed to be escaped by backslash.

Upvotes: 0

Jotha
Jotha

Reputation: 428

The * operator captures as many characters as possible. This means it captures until the last occurence of \end{figure} If you only want to capture as many characters as needed, use *? instead: pattern = r"\\begin{figure}([\s\S^f]*?)\\end{figure}".

Upvotes: 1

Related Questions