Getting the top wallpaper from reddit

Question

I am trying to get the hottest Wallpaper from Reddit's wallpaper subreddit. I am using beautiful soup to get the HTML layout of the first wallpaper And then regex to get the URL from the anchor tag. But more than often I get a URL that's not matched by my regex. Here's the code I am using:

r = requests.get("https://www.reddit.com/r/wallpapers")
if r.status_code == 200:
    print r.status_code
    text = r.text
    soup = BeautifulSoup(text, "html.parser")

search_string = str(soup.find('a', {'class':'title'}))
photo_url = str(re.search('[htps:/]{7,8}[a-zA-Z0-9._/:.]+[a-zA-Z0-9./:.-]+', search_string).group())

Is there any way around it?

Jarwin · Accepted Answer

Here's a better way to do it:
Adding .json to the end of url in Reddit returns a json object instead of HTML.
For example https://www.reddit.com/r/wallpapers will provide HTML content but
https://www.reddit.com/r/wallpapers/.json will give you a json object which you can easily exploit using json module in python

Here's the same program of getting the hottest wallpaper:

>>> import urllib
>>> import json

>>> data = urllib.urlopen('https://www.reddit.com/r/wallpapers/.json')
>>> wallpaper_dict = json.loads(data.read())

>>> wallpaper_dict['data']['children'][1]['data']['url']
u'http://i.imgur.com/C49VtMu.jpg'

>>> wallpaper_dict['data']['children'][1]['data']['title']
u'Space Shuttle'

>>> wallpaper_dict['data']['children'][1]['data']['domain']
u'i.imgur.com'

Not only it's much more cleaner, it'll also prevent you from a headache if reddit changes it's HTML layout or someone posts a URL that your regex can't handle.
As a thumb rule it's generally smarter to use json instead of scraping the HTML

PS: The list inside the [children] is the wallpaper number. The first one is the topmost, the second one is the second one and so on. Therefore ['data']['children'][2]['data']['url'] will give you the link for the second hottest wallpaper. you get the gist? :)

PPS: What's more is that with this method you can use the default urllib module. Generally when you're scraping Reddit you'd have to create fake User-Agent header and pass it while requesting(or it gives you a 429 response code but that's not the case with this method.

Getting the top wallpaper from reddit

Answers (2)

Related Questions