Reputation: 2253
I have a text like this
s = """
...
(1) Literature
1. a.
2. b.
3. c.
...
"""
I want to cut Literature section but I have some problem with detection.
I use here
re.search("(1) Literature\n\n(.*).\n\n", s).group(1)
but search
return None.
Desire output is
(1) Literature
1. a.
2. b.
3. c.
What did I do wrong?
Upvotes: 0
Views: 102
Reputation: 784958
You may use this regex with a capture group:
r'\(1\)\s+Literature\s+((?:.+\n)+)'
Explanation:
\(1\)
: Match (1)
text\s+
: Match 1+ whitespacesLiterature
:\s+
:(
: Start capture group #1
(?:.+\n)+
: Match a line with 1+ character followed by newline. Repeat this 1 or more times to allow it to match multiple such lines)
: End capture group #1Upvotes: 1
Reputation: 3194
Regex for capturing the generic question with that structure:
\(\d+\)\s+(\w+)\s+((?:\d+\.\s.+\n)+)
It will capture the title "Literature", then the choices in another group (for a total of 2 groups).
It is not possible to capture repeating groups, so in order to get each of your "1. a." in a separate group you would have to match the second group from above again, with this pattern:
((\d+\.\s+.+)\n)+)
then globally match to get all groups.
Upvotes: 0
Reputation: 163207
You could match (1) Literature
and 2 newlines, and then capture all lines that start with digits followed by a dot.
\(1\) Literature\n\n((?:\d+\..*(?:\n|$))+)
The pattern matches:
\(1\) Literature\n\n
Match (1) Literature and 2 newlines(
Capture group 1
(?:
Non capture group
\d+\..*(?:\n|$)
Match 1+ digits and a dot followed by either a newline or end of string)+
Close non capture group and repeat it 1 or more times to match all the lines)
Close group 1Another option is to capture all following lines that do not start with (
digits )
using a negative lookahead, and then trim the leading and trailing whitespaces.
\(1\) Literature((?:\n(?!\(\d+\)).*)*)
Upvotes: 2
Reputation: 978
Parentheses have a special meaning in regex. They are used to group matches.
(1)
- Capture 1 as the first capturing group.
Since the string has parentheses in it, the match is not successful. And .*
capturing end with line end.
Based on your regex, I assumed you wanted to capture the line with the word Literature
, 5 lines below it. Here is a regex to do so.
\(1\) Literature(.*\n){5}
Note the scape characters used on parentheses around 1
.
EDIT
Based on zr0gravity7's comment, I came up with this regex to capture the middle section on the string.
\(1\)\sLiterature\n+((.*\n){3})
This regex will capture the below string in capturing group 1.
1. a.
2. b.
3. c.
Upvotes: 1