Regex to match and clean quotes in python

Question

I have a bunch of quotes scraped from Goodreads stored in a bs4.element.ResultSet, with each element of type bs4.element.Tag. I'm trying to use regex with the re module in python 3.6.3 to clean the quotes and get just the text. When I iterate and print using [print(q.text) for q in quotes] some quotes look like this

“Don't cry because it's over, smile because it happened.”

―

while others look like this:

“If you want to know what a man's like, take a good look at how he treats his inferiors, not his equals.”

―

,

Each also has some extra blank lines at the end. My thought was I could iterate through quotes and call re.match on each quote as follows:

cleaned_quotes = []    
for q in quote:
    match = re.match(r'“[A-Z].+$”', str(q))
    cleaned_quotes.append(match.group())

I'm guessing my regex pattern didn't match anything because I'm getting the following error:

AttributeError: 'NoneType' object has no attribute 'group'

Not surprisingly, printing the list gives me a list of None objects. Any ideas on what I might be doing wrong?

wp78de · Accepted Answer

As you requested this for learning purpose, here's the regex answer:

(?<=“)[\s\s]+?(?=”)

Explanation:

We use a positive lookbehind to and lookahead to mark the beginning and end of the pattern and remove the quotes from result at the same time. Inside of the quotes we lazy match anything with the .+?

Online Demo

Sample Code:

import re
regex = r"(?<=“)[\s\S]+?(?=”)"
cleaned_quotes = []    
for q in quote:
    m = re.search(regex, str(q))
    if m:
        cleaned_quotes.append(m.group())

Arguably, we do not need any regex flags. Add the g|gloabal flag for multiple matches. And m|multiline to process matches line by line (in such a scenario could be required to use [\s\S] instead of the dot to get line spanning results.) This will also change the behavior of the positional anchors ^ and $, to match the end of the line instead of the string. Therefore, adding these positional anchors in-between is just wrong.

One more thing, I use re.search() since re.match() matches only from the beginning of the string. A common gotcha. See the documentation.

Regex to match and clean quotes in python

Answers (2)

Related Questions