lostnote
lostnote

Reputation: 43

Regular Expression not returning results

I've written a regular expression and tested it in regex101.com, yet when I implement it in my code, I get no values returned and I have got no idea why.

I'm scraping a HTML document (an RSS feed specifically) and have got other regex's working with that HTML document within the same program, just not this particular one! I'm just at a loss since it works in regex101.com (and in another Python program I have access to which was developed specifically for testing regex's. I need to scrape the title of the article, the description and the date/time it was posted. Titles and date/time work (example of title working below) but I cannot get the description (variable 'snippets') to print.

What I have tried:

#There's a 'download' function earlier on which downloads the RSS page to a file
text_in = download(url='https://www.theverge.com/rss/index.xml', target_filename = 'downloadtheverge')
text_in = open('downloadtheverge.xhtml', 'r', encoding="utf8").read()

snippetresults = sorted
(set(findall(r'<p\sid=\"[A-Za-z0-9]*\">([A-Za-z0-9\s\-\—\:\/\,\’\'\‘\?\!\.]*\s?)<\/p>', text_in)))
for snippets in snippetresults:
    print(snippets)

An example of what is being searched:

<p id="BjKuOh">Only a single key change isn’t being reversed: YouTube will actually verify that channels are authentic, whereas in the past it seemingly has not thoroughly taken this very obvious step.</p>

What is returned from the regex on regex101.com:

'Only a single key change isn’t being reversed: YouTube will actually verify that channels are authentic, whereas in the past it seemingly has not thoroughly taken this very obvious step.'

What does work:

titlesresults = sorted
(set(findall(r'<title>([A-Za-z0-9\s\-\—\:\/\,\’\'\‘\?\!\.]+\s?)<\/title>', text_in)))
for titles in titlesresults:
    print(titles)

Same format, returns the titles in the HTML document to the shell window, like this: 'Beats headphones will get the same iOS 13.1 audio sharing feature as AirPods Don’t update to iOS 13.0 if you play Fortnite or PUBG Mobile' etc etc

Yet when I run it in my program using the 'snippets', the shell window returns nothing... Any help would be greatly appreciated!

Upvotes: 1

Views: 88

Answers (1)

Grismar
Grismar

Reputation: 31329

This doesn't work:

from re import findall
from urllib import request

text_in = request.urlopen(url='https://www.theverge.com/rss/index.xml').read().decode()

snippetresults = sorted(set(findall(r'<p\sid=\"[A-Za-z0-9]*\">([A-Za-z0-9\s\-\—\:\/\,\’\'\‘\?\!\.]*\s?)<\/p>', text_in)))
for snippets in snippetresults:
    print(snippets)

But this does (note the html entities):

from re import findall
from urllib import request

text_in = request.urlopen(url='https://www.theverge.com/rss/index.xml').read().decode()

snippetresults = sorted(set(findall(r'&lt;p\sid=\"[A-Za-z0-9]*\"&gt;([A-Za-z0-9\s\-\—\:\/\,\’\'\‘\?\!\.]*\s?)&lt;\/p&gt;', text_in)))
for snippets in snippetresults:
    print(snippets)

Upvotes: 1

Related Questions