user5730426
user5730426

Reputation:

Getting the top wallpaper from reddit

I am trying to get the hottest Wallpaper from Reddit's wallpaper subreddit. I am using beautiful soup to get the HTML layout of the first wallpaper And then regex to get the URL from the anchor tag. But more than often I get a URL that's not matched by my regex. Here's the code I am using:

r = requests.get("https://www.reddit.com/r/wallpapers")
if r.status_code == 200:
    print r.status_code
    text = r.text
    soup = BeautifulSoup(text, "html.parser")

search_string = str(soup.find('a', {'class':'title'}))
photo_url = str(re.search('[htps:/]{7,8}[a-zA-Z0-9._/:.]+[a-zA-Z0-9./:.-]+', search_string).group())

Is there any way around it?

Upvotes: 4

Views: 348

Answers (2)

Jarwin
Jarwin

Reputation: 1133

Here's a better way to do it:
Adding .json to the end of url in Reddit returns a json object instead of HTML.
For example https://www.reddit.com/r/wallpapers will provide HTML content but
https://www.reddit.com/r/wallpapers/.json will give you a json object which you can easily exploit using json module in python

Here's the same program of getting the hottest wallpaper:

>>> import urllib
>>> import json

>>> data = urllib.urlopen('https://www.reddit.com/r/wallpapers/.json')
>>> wallpaper_dict = json.loads(data.read())

>>> wallpaper_dict['data']['children'][1]['data']['url']
u'http://i.imgur.com/C49VtMu.jpg'

>>> wallpaper_dict['data']['children'][1]['data']['title']
u'Space Shuttle'

>>> wallpaper_dict['data']['children'][1]['data']['domain']
u'i.imgur.com'

Not only it's much more cleaner, it'll also prevent you from a headache if reddit changes it's HTML layout or someone posts a URL that your regex can't handle.
As a thumb rule it's generally smarter to use json instead of scraping the HTML

PS: The list inside the [children] is the wallpaper number. The first one is the topmost, the second one is the second one and so on. Therefore ['data']['children'][2]['data']['url'] will give you the link for the second hottest wallpaper. you get the gist? :)

PPS: What's more is that with this method you can use the default urllib module. Generally when you're scraping Reddit you'd have to create fake User-Agent header and pass it while requesting(or it gives you a 429 response code but that's not the case with this method.

Upvotes: 5

heinst
heinst

Reputation: 8786

Here is the correct way to do it your method, but Jarwins method is better. You should not be using regex when working with HTML. You simply had to reference the href attribute to get the URL

import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.reddit.com/r/wallpapers")
if r.status_code == 200:
    soup = BeautifulSoup(r.text, "html.parser")
    url = str(soup.find_all('a', {'class':'title'})[1]["href"])
    print url

Upvotes: 1

Related Questions