Reputation: 185
I started a little project. I am trying to scrape the URL http://pr0gramm.com/ and save the tags under a picture in a variable, but I have problems to do so.
I am searching for this in the code
<a class="tag-link" href="/top/Flaschenkind">Flaschenkind</a>
And I actually just need the part "Flaschenkind" to be saved, but also the following tags in that line.
This is my code so far
import requests
from bs4 import BeautifulSoup
url = "http://pr0gramm.com/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
links = soup.find_all("div", {"class" : "item-tags"})
print(links)
I sadly just get this output
[]
I already tried to change the URL to http://pr0gramm.com/top/ but I get the same output. I wonder if it happens because the site might be made with JavaScript and it can't scrape the data correctly then?
Upvotes: 0
Views: 219
Reputation: 1194
First off your URL is a Java Script enabled version of this site. They offer a static URL as www.pr0gramm.com/static/ Here you'll find the content formatted more like your example suggests you expect.
Using this static version of the URL I retrieved <a>
tags using the code below like yours. I removed the class tag filter. Python 2.7
import bs4
import urllib2
def main():
url = "http://pr0gramm.com/static/"
try:
fin = urllib2.urlopen(url)
except:
print "Url retrieval failed url:",url
return None
html = fin.read()
bs = bs4.BeautifulSoup(html,"html5lib")
links = bs.find_all("a")
print links
return None
if __name__ == "__main__":
main()
Upvotes: 0
Reputation: 473903
The problem is - this is a dynamic site and all of the data you see is loaded via additional XHR calls to the website JSON API. You need to simulate that in your code.
Working example using requests
:
from urllib.parse import urljoin
import requests
base_image_url = "http://img.pr0gramm.com"
with requests.Session() as session:
response = session.get("http://pr0gramm.com/api/items/get", params={"flags": 1, "promoted": "1"})
posts = response.json()["items"]
for post in posts:
image_url = urljoin(base_image_url, post["image"])
# get tags
response = session.get("http://pr0gramm.com/api/items/info", params={"itemId": post["id"]})
post_data = response.json()
tags = [tag["tag"] for tag in post_data["tags"]]
print(image_url, tags)
This would print the post image url as well as a list of post tags:
http://img.pr0gramm.com/2016/03/07/f693234d558334d7.jpg ['Datsun 1600 Wagon', 'Garage 88', 'Kombi', 'nur Oma liegt tiefer', 'rolladen', 'slow']
http://img.pr0gramm.com/2016/03/07/185544cda956679e.webm ['Danke Merkel', 'deeskalierte zeitnah', 'demokratie im endstadium', 'Fachkraft', 'Far Cry Primal', 'Invite is raus', 'typ ist nackt', 'VVS', 'webm', 'zeigt seine stange']
http://img.pr0gramm.com/2016/03/07/4a6719b33219fd87.jpg ['bmw', 'der Gerät', 'Drehmoment', 'für mehr Motorräder auf pr0', 'Motorrad']
...
Upvotes: 1