Reputation: 43
I've written a regular expression and tested it in regex101.com, yet when I implement it in my code, I get no values returned and I have got no idea why.
I'm scraping a HTML document (an RSS feed specifically) and have got other regex's working with that HTML document within the same program, just not this particular one! I'm just at a loss since it works in regex101.com (and in another Python program I have access to which was developed specifically for testing regex's. I need to scrape the title of the article, the description and the date/time it was posted. Titles and date/time work (example of title working below) but I cannot get the description (variable 'snippets') to print.
What I have tried:
#There's a 'download' function earlier on which downloads the RSS page to a file
text_in = download(url='https://www.theverge.com/rss/index.xml', target_filename = 'downloadtheverge')
text_in = open('downloadtheverge.xhtml', 'r', encoding="utf8").read()
snippetresults = sorted
(set(findall(r'<p\sid=\"[A-Za-z0-9]*\">([A-Za-z0-9\s\-\—\:\/\,\’\'\‘\?\!\.]*\s?)<\/p>', text_in)))
for snippets in snippetresults:
print(snippets)
An example of what is being searched:
<p id="BjKuOh">Only a single key change isn’t being reversed: YouTube will actually verify that channels are authentic, whereas in the past it seemingly has not thoroughly taken this very obvious step.</p>
What is returned from the regex on regex101.com:
'Only a single key change isn’t being reversed: YouTube will actually verify that channels are authentic, whereas in the past it seemingly has not thoroughly taken this very obvious step.'
What does work:
titlesresults = sorted
(set(findall(r'<title>([A-Za-z0-9\s\-\—\:\/\,\’\'\‘\?\!\.]+\s?)<\/title>', text_in)))
for titles in titlesresults:
print(titles)
Same format, returns the titles in the HTML document to the shell window, like this: 'Beats headphones will get the same iOS 13.1 audio sharing feature as AirPods Don’t update to iOS 13.0 if you play Fortnite or PUBG Mobile' etc etc
Yet when I run it in my program using the 'snippets', the shell window returns nothing... Any help would be greatly appreciated!
Upvotes: 1
Views: 88
Reputation: 31329
This doesn't work:
from re import findall
from urllib import request
text_in = request.urlopen(url='https://www.theverge.com/rss/index.xml').read().decode()
snippetresults = sorted(set(findall(r'<p\sid=\"[A-Za-z0-9]*\">([A-Za-z0-9\s\-\—\:\/\,\’\'\‘\?\!\.]*\s?)<\/p>', text_in)))
for snippets in snippetresults:
print(snippets)
But this does (note the html entities):
from re import findall
from urllib import request
text_in = request.urlopen(url='https://www.theverge.com/rss/index.xml').read().decode()
snippetresults = sorted(set(findall(r'<p\sid=\"[A-Za-z0-9]*\">([A-Za-z0-9\s\-\—\:\/\,\’\'\‘\?\!\.]*\s?)<\/p>', text_in)))
for snippets in snippetresults:
print(snippets)
Upvotes: 1