Reputation: 133
I have a bunch of quotes scraped from Goodreads stored in a bs4.element.ResultSet
, with each element of type bs4.element.Tag
. I'm trying to use regex with the re module in python 3.6.3 to clean the quotes and get just the text. When I iterate and print using [print(q.text) for q in quotes]
some quotes look like this
“Don't cry because it's over, smile because it happened.”
―
while others look like this:
“If you want to know what a man's like, take a good look at how he treats his inferiors, not his equals.”
―
,
Each also has some extra blank lines at the end. My thought was I could iterate through quotes
and call re.match
on each quote as follows:
cleaned_quotes = []
for q in quote:
match = re.match(r'“[A-Z].+$”', str(q))
cleaned_quotes.append(match.group())
I'm guessing my regex pattern didn't match anything because I'm getting the following error:
AttributeError: 'NoneType' object has no attribute 'group'
Not surprisingly, printing the list gives me a list of None
objects. Any ideas on what I might be doing wrong?
Upvotes: 1
Views: 508
Reputation: 18980
As you requested this for learning purpose, here's the regex answer:
(?<=“)[\s\s]+?(?=”)
Explanation:
We use a positive lookbehind to and lookahead to mark the beginning and end of the pattern and remove the quotes from result at the same time.
Inside of the quotes we lazy match anything with the .+?
Sample Code:
import re
regex = r"(?<=“)[\s\S]+?(?=”)"
cleaned_quotes = []
for q in quote:
m = re.search(regex, str(q))
if m:
cleaned_quotes.append(m.group())
Arguably, we do not need any regex flags. Add the g
|gloabal flag for multiple matches. And m
|multiline to process matches line by line (in such a scenario could be required to use [\s\S]
instead of the dot to get line spanning results.)
This will also change the behavior of the positional anchors ^
and $
, to match the end of the line instead of the string. Therefore, adding these positional anchors in-between is just wrong.
One more thing, I use re.search()
since re.match()
matches only from the beginning of the string. A common gotcha. See the documentation.
Upvotes: 3
Reputation: 4668
First of all, in your expression r'“[A-Z].+$”'
end of line $
is defined before "
, which is logically not possible.
To use $
in regexi for multiline strings, you should also specify re.MULTILINE
flag.
Second - re.match
expects to match the whole value, not find part of string that matches regular expression.
Meaning re.search
should do what you initially expected to accomplish.
So the resulting regex could be:
re.search(r'"[A-Z].+"$', str(q), re.MULTILINE)
Upvotes: 0