Reputation:
I am trying to get the hottest Wallpaper from Reddit's wallpaper subreddit.
I am using beautiful soup
to get the HTML
layout of the first wallpaper And then regex
to get the URL
from the anchor tag. But more than often I get a URL that's not matched by my regex. Here's the code I am using:
r = requests.get("https://www.reddit.com/r/wallpapers")
if r.status_code == 200:
print r.status_code
text = r.text
soup = BeautifulSoup(text, "html.parser")
search_string = str(soup.find('a', {'class':'title'}))
photo_url = str(re.search('[htps:/]{7,8}[a-zA-Z0-9._/:.]+[a-zA-Z0-9./:.-]+', search_string).group())
Is there any way around it?
Upvotes: 4
Views: 348
Reputation: 1133
Here's a better way to do it:
Adding .json
to the end of url in Reddit returns a json
object instead of HTML
.
For example https://www.reddit.com/r/wallpapers
will provide HTML content but
https://www.reddit.com/r/wallpapers/.json
will give you a json object which you can easily exploit using json
module in python
Here's the same program of getting the hottest wallpaper:
>>> import urllib
>>> import json
>>> data = urllib.urlopen('https://www.reddit.com/r/wallpapers/.json')
>>> wallpaper_dict = json.loads(data.read())
>>> wallpaper_dict['data']['children'][1]['data']['url']
u'http://i.imgur.com/C49VtMu.jpg'
>>> wallpaper_dict['data']['children'][1]['data']['title']
u'Space Shuttle'
>>> wallpaper_dict['data']['children'][1]['data']['domain']
u'i.imgur.com'
Not only it's much more cleaner, it'll also prevent you from a headache if reddit changes it's HTML layout or someone posts a URL that your regex can't handle.
As a thumb rule it's generally smarter to use json
instead of scraping the HTML
PS: The list inside the [children]
is the wallpaper number. The first one is the topmost, the second one is the second one and so on.
Therefore ['data']['children'][2]['data']['url']
will give you the link for the second hottest wallpaper. you get the gist? :)
PPS: What's more is that with this method you can use the default urllib
module. Generally when you're scraping Reddit
you'd have to create fake User-Agent
header and pass it while requesting(or it gives you a 429 response code but that's not the case with this method.
Upvotes: 5
Reputation: 8786
Here is the correct way to do it your method, but Jarwins method is better. You should not be using regex when working with HTML. You simply had to reference the href
attribute to get the URL
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.reddit.com/r/wallpapers")
if r.status_code == 200:
soup = BeautifulSoup(r.text, "html.parser")
url = str(soup.find_all('a', {'class':'title'})[1]["href"])
print url
Upvotes: 1