kratze
kratze

Reputation: 185

BeautifulSoup - which URL

I started a little project. I am trying to scrape the URL http://pr0gramm.com/ and save the tags under a picture in a variable, but I have problems to do so.

I am searching for this in the code

<a class="tag-link" href="/top/Flaschenkind">Flaschenkind</a>

And I actually just need the part "Flaschenkind" to be saved, but also the following tags in that line.

This is my code so far

import requests
from bs4 import BeautifulSoup

url = "http://pr0gramm.com/"
r = requests.get(url)

soup = BeautifulSoup(r.content, "lxml")

links = soup.find_all("div", {"class" : "item-tags"})

print(links)

I sadly just get this output

[]

I already tried to change the URL to http://pr0gramm.com/top/ but I get the same output. I wonder if it happens because the site might be made with JavaScript and it can't scrape the data correctly then?

Upvotes: 0

Views: 219

Answers (2)

JimmyNJ
JimmyNJ

Reputation: 1194

First off your URL is a Java Script enabled version of this site. They offer a static URL as www.pr0gramm.com/static/ Here you'll find the content formatted more like your example suggests you expect.

Using this static version of the URL I retrieved <a> tags using the code below like yours. I removed the class tag filter. Python 2.7

import bs4
import urllib2

def main():

    url = "http://pr0gramm.com/static/"
    try:
        fin = urllib2.urlopen(url)
    except:
        print "Url retrieval failed url:",url
        return None

    html = fin.read()

    bs = bs4.BeautifulSoup(html,"html5lib")

    links = bs.find_all("a")
    print links
    return None


if __name__ == "__main__":
    main()

Upvotes: 0

alecxe
alecxe

Reputation: 473903

The problem is - this is a dynamic site and all of the data you see is loaded via additional XHR calls to the website JSON API. You need to simulate that in your code.

Working example using requests:

from urllib.parse import urljoin

import requests

base_image_url = "http://img.pr0gramm.com"
with requests.Session() as session:
    response = session.get("http://pr0gramm.com/api/items/get", params={"flags": 1, "promoted": "1"})

    posts = response.json()["items"]
    for post in posts:
        image_url = urljoin(base_image_url, post["image"])

        # get tags
        response = session.get("http://pr0gramm.com/api/items/info", params={"itemId": post["id"]})
        post_data = response.json()
        tags = [tag["tag"] for tag in post_data["tags"]]

        print(image_url, tags)

This would print the post image url as well as a list of post tags:

http://img.pr0gramm.com/2016/03/07/f693234d558334d7.jpg ['Datsun 1600 Wagon', 'Garage 88', 'Kombi', 'nur Oma liegt tiefer', 'rolladen', 'slow']
http://img.pr0gramm.com/2016/03/07/185544cda956679e.webm ['Danke Merkel', 'deeskalierte zeitnah', 'demokratie im endstadium', 'Fachkraft', 'Far Cry Primal', 'Invite is raus', 'typ ist nackt', 'VVS', 'webm', 'zeigt seine stange']
http://img.pr0gramm.com/2016/03/07/4a6719b33219fd87.jpg ['bmw', 'der Gerät', 'Drehmoment', 'für mehr Motorräder auf pr0', 'Motorrad']
...

Upvotes: 1

Related Questions