aless80
aless80

Reputation: 3332

Python returns no matches on working regex

I have a string (not raw) in python similar to the following:

Plenary Papers (1)
Peer-reviewed Papers (113)
PLENARY MANUSCRIPTS (1)
First Author Index

Harrer
Plenary Papers

One Some title
John W. Doe
2018 Physics SOmething Proceedings
Full Text: Download PDF - PER-Central Record
Show Abstract - Show Citation
PEER REVIEWED MANUSCRIPTS (113)
First Author Index

Doe · Doe2 · Doe3 · Jonathan
Peer-reviewed Papers

Two some title
Alex White, Paul Klee, and Jacson Pollock
2018 Physics Research Conference Proceedings, doi:10.1234/perc.2018.pr.White
Full Text: Download PDF - PER-Central Record
Show Abstract - Show Citation

Tree Some title
Suzanne Heck, Alex Someone, John I. Smith, and Andrew Bourgogne
2018 Physics Education Research Conference Proceedings, doi:10.2345/perc.2018.pr.Heck
Full Text: Download PDF - PER-Central Record
Show Abstract - Show Citation

..

I want to scrape the metadata of those three papers, i.e. those few lines after each title (e.g. "One Some title" "John W. Doe", and 2018 Physics Something Proceedings").

I though of using two patterns for the beginning and end of the selection:

'r"\n\n"' and 'r"Show Abstract - Show Citation"'.

This (almost) works on https://regex101.com/using this regular expression:

\n\n(.*?)Show Abstract - Show Citation

A minor issue is that it is greedy on the first two papers.

but not in python:

    pattern=r"\n\n(.*?)Show Abstract - Show Citation"

    re.findall(pattern, titles) #titles is the text above

    #output is []
    pattern_only_one_line=r"\nShow Abstract - Show Citation"

    re.findall(pattern_only_one_line, titles)

    #output shows three lines

Could this be another problem with raw strings?

Upvotes: 1

Views: 104

Answers (1)

Wolph
Wolph

Reputation: 80021

The re.DOTALL flag is missing. Without it . won't match newlines.

But we can do better (depending on what you need exactly of course): https://regex101.com/r/iN6pX6/199

import re
import pprint

titles = '''
[Omitted for brevity]
..
'''

pattern = r'''
(?P<title>[^\n]+)\n
(?P<subtitle>[^\n]+)\n
((?P<etc>[^\n].*?)\n\n|\n)
'''

# Make sure we don't have any extraneous whitespace but add the separator
titles = titles.strip() + '\n\n'

for match in re.finditer(pattern, titles, re.DOTALL | re.VERBOSE):
    title = match.group('title')
    subtitle = match.group('subtitle')
    etc = match.group('etc')
    print('## %r' % title)
    print('# %r' % subtitle)
    if etc:
      print(etc)
    print()
    # pprint.pprint(match.groupdict())

Upvotes: 1

Related Questions