Scraping pdfs from a webpage

Question

I would like to download all financial reports for a given company from the Danish company register (csv register). An example could be Chr. Hansen Holding in the link below:

https://datacvr.virk.dk/data/visenhed?enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da

Specifically, I would like to download all the PDF under the tab "Regnskaber" (=Financial reports). I do not have previous experience with webscraping using Python. I tried using BeautifulSoup, but given my non-existing experience, I cannot find the correct way to search from the response.

Below are what I tried, but no data are printed (i.e. it did not find any pdfs).

from urllib.parse import urljoin
from bs4 import BeautifulSoup

web_page = "https://datacvr.virk.dk/data/visenhed? 
enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da"

response = requests.get(web_page)
soup = BeautifulSoup(response.text)
soup.findAll('accordion-toggle')

for link in soup.select("a[href$='.pdf']"):
    print(link['href'].split('/')[-1])

All help and guidance will be much appreciated.

Alon Arad · Accepted Answer

you should use select instead of findAll

from urllib.parse import urljoin
from bs4 import BeautifulSoup

web_page = "https://datacvr.virk.dk/data/visenhed? 
enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da"

response = requests.get(web_page)
soup = BeautifulSoup(response.text, 'lxml')
pdfs = soup.select('div[id="accordion-Regnskaber-og-nogletal"] a[data-type="PDF"]')

for link in pdfs:
    print(link['href'].split('/')[-1])

Scraping pdfs from a webpage

Answers (1)

Related Questions