NineWasps
NineWasps

Reputation: 2253

Python: find a string between 2 strings in text

I have a text like this

s = """
...

(1) Literature

1. a.
2. b.
3. c.

...
"""

I want to cut Literature section but I have some problem with detection.

I use here

re.search("(1) Literature\n\n(.*).\n\n", s).group(1)

but search return None.

Desire output is

(1) Literature

1. a.
2. b.
3. c.

 

What did I do wrong?

Upvotes: 0

Views: 102

Answers (4)

anubhava
anubhava

Reputation: 784958

You may use this regex with a capture group:

r'\(1\)\s+Literature\s+((?:.+\n)+)'

RegEx Demo

Explanation:

  • \(1\): Match (1) text
  • \s+: Match 1+ whitespaces
  • Literature:
  • \s+:
  • (: Start capture group #1
    • (?:.+\n)+: Match a line with 1+ character followed by newline. Repeat this 1 or more times to allow it to match multiple such lines
  • ): End capture group #1

Upvotes: 1

zr0gravity7
zr0gravity7

Reputation: 3194

Regex for capturing the generic question with that structure:

\(\d+\)\s+(\w+)\s+((?:\d+\.\s.+\n)+)

It will capture the title "Literature", then the choices in another group (for a total of 2 groups).

It is not possible to capture repeating groups, so in order to get each of your "1. a." in a separate group you would have to match the second group from above again, with this pattern:

((\d+\.\s+.+)\n)+) then globally match to get all groups.

Upvotes: 0

The fourth bird
The fourth bird

Reputation: 163207

You could match (1) Literature and 2 newlines, and then capture all lines that start with digits followed by a dot.

\(1\) Literature\n\n((?:\d+\..*(?:\n|$))+)

The pattern matches:

  • \(1\) Literature\n\n Match (1) Literature and 2 newlines
  • ( Capture group 1
    • (?: Non capture group
      • \d+\..*(?:\n|$) Match 1+ digits and a dot followed by either a newline or end of string
    • )+ Close non capture group and repeat it 1 or more times to match all the lines
  • ) Close group 1

Regex demo


Another option is to capture all following lines that do not start with ( digits ) using a negative lookahead, and then trim the leading and trailing whitespaces.

\(1\) Literature((?:\n(?!\(\d+\)).*)*)

Regex demo

Upvotes: 2

Pubudu Sitinamaluwa
Pubudu Sitinamaluwa

Reputation: 978

Parentheses have a special meaning in regex. They are used to group matches.

(1) - Capture 1 as the first capturing group.

Since the string has parentheses in it, the match is not successful. And .* capturing end with line end.

Check Demo

Based on your regex, I assumed you wanted to capture the line with the word Literature, 5 lines below it. Here is a regex to do so.

\(1\) Literature(.*\n){5}

Regex Demo

Note the scape characters used on parentheses around 1.

EDIT

Based on zr0gravity7's comment, I came up with this regex to capture the middle section on the string.

\(1\)\sLiterature\n+((.*\n){3})

This regex will capture the below string in capturing group 1.

1. a.
2. b.
3. c.

Regex Demo

Upvotes: 1

Related Questions