Reputation: 3332
I have a string (not raw) in python similar to the following:
Plenary Papers (1)
Peer-reviewed Papers (113)
PLENARY MANUSCRIPTS (1)
First Author Index
Harrer
Plenary Papers
One Some title
John W. Doe
2018 Physics SOmething Proceedings
Full Text: Download PDF - PER-Central Record
Show Abstract - Show Citation
PEER REVIEWED MANUSCRIPTS (113)
First Author Index
Doe · Doe2 · Doe3 · Jonathan
Peer-reviewed Papers
Two some title
Alex White, Paul Klee, and Jacson Pollock
2018 Physics Research Conference Proceedings, doi:10.1234/perc.2018.pr.White
Full Text: Download PDF - PER-Central Record
Show Abstract - Show Citation
Tree Some title
Suzanne Heck, Alex Someone, John I. Smith, and Andrew Bourgogne
2018 Physics Education Research Conference Proceedings, doi:10.2345/perc.2018.pr.Heck
Full Text: Download PDF - PER-Central Record
Show Abstract - Show Citation
..
I want to scrape the metadata of those three papers, i.e. those few lines after each title (e.g. "One Some title" "John W. Doe", and 2018 Physics Something Proceedings").
I though of using two patterns for the beginning and end of the selection:
'r"\n\n"' and 'r"Show Abstract - Show Citation"'.
This (almost) works on https://regex101.com/using this regular expression:
\n\n(.*?)Show Abstract - Show Citation
A minor issue is that it is greedy on the first two papers.
but not in python:
pattern=r"\n\n(.*?)Show Abstract - Show Citation"
re.findall(pattern, titles) #titles is the text above
#output is []
pattern_only_one_line=r"\nShow Abstract - Show Citation"
re.findall(pattern_only_one_line, titles)
#output shows three lines
Could this be another problem with raw strings?
Upvotes: 1
Views: 104
Reputation: 80021
The re.DOTALL
flag is missing. Without it .
won't match newlines.
But we can do better (depending on what you need exactly of course): https://regex101.com/r/iN6pX6/199
import re
import pprint
titles = '''
[Omitted for brevity]
..
'''
pattern = r'''
(?P<title>[^\n]+)\n
(?P<subtitle>[^\n]+)\n
((?P<etc>[^\n].*?)\n\n|\n)
'''
# Make sure we don't have any extraneous whitespace but add the separator
titles = titles.strip() + '\n\n'
for match in re.finditer(pattern, titles, re.DOTALL | re.VERBOSE):
title = match.group('title')
subtitle = match.group('subtitle')
etc = match.group('etc')
print('## %r' % title)
print('# %r' % subtitle)
if etc:
print(etc)
print()
# pprint.pprint(match.groupdict())
Upvotes: 1