Python – Extract certain links from website

Question

I want to extract certain links from a website.

To extract all links, I tried:

import urllib
import xml.etree.ElementTree as ET
from BeautifulSoup import *

url = 'http://pdok.bundestag.de/index.php?qsafe=&aload=off&q=kleine+anfrage&x=0&y=0&df=22.10.2013&dt=13.01.2016'
uh = urllib.urlopen(url)
data = uh.read()
soup=BeautifulSoup(data)
soup.prettify()

for href in soup.findAll('a'):
    print href

Now, I get a list of links, but for some reason I don't get the important links that are in tbody. I also tried using ElementTree, but I get an error just reading the link, because it uses some invalid symbols or so (?). Any help is much appreciated! :)

gtlambert · Accepted Answer

urllib loads the HTML of the website with Javascript off. The links that you are trying to scrape in the tbody are rendered by JavaScript, so never load.

You can replicate this behaviour by turning JavaScript off in your browser and visiting the website. If you scrape frequently, you may wish to download a browser plugin which allows you to turn JavaScript on and off quickly.

To scrape websites which load HTML content with JavaScript you may wish to explore browser automation options such as selenium.

Python – Extract certain links from website

Answers (1)

Related Questions