Webscraping pdf links - not returning results

Question

I have set up some code to scrape the pdfs off a local council website. I've requested the page I want, then got the links to different dates, then within each of them the links to the pdfs. However it's not returning any results.

I've played around with the code and can't figure it out. It's running ok in jupyter notebook and not returning any errors.

This is my code:

import requests
from bs4 import BeautifulSoup as bs

dates = ['April 2019', 'July 2019', 'December 2018']
r = requests.get('https://www.gmcameetings.co.uk/meetings/committee/36/economy_business_growth_and_skills_overview_and_scrutiny')
soup = bs(r.content, 'lxml')

f = open(r"E:\Internship\WORK\GMCA\Getting PDFS\gmcabusinessdatelinks.txt", "w+")

for date in dates:
        if ['a'] in soup.select('a:contains("' + date + '")'):
            r2 = requests.get(date['href'])
            print("link1")
            page2 = r2.text
            soup2 = bs(page2, 'lxml')
            pdf_links = soup2.find_all('a', href=True)
            for plink in pdf_links:
                if plink['href'].find('minutes')>1:
                    print("Minutes!")
                    f.write(str(plink['href']) + ' ')
f.close()

It creates a text file but it's blank. I want a text file with all of the links to the pdfs. Thanks.

SIM · Accepted Answer

If you wanted to get the pdf links containing minutes keyword then the following should work:

import requests
from bs4 import BeautifulSoup

link = 'https://www.gmcameetings.co.uk/meetings/committee/36/economy_business_growth_and_skills_overview_and_scrutiny'

dates = ['April 2019', 'July 2019', 'December 2018']

r = requests.get(link)
soup = BeautifulSoup(r.text, 'lxml')
target_links = [[i['href'] for i in soup.select(f'a:contains("{date}")')] for date in dates]

with open("output_file.txt","w",encoding="utf-8") as f:
    for target_link in target_links:

        res = requests.get(target_link[0])
        soup_obj = BeautifulSoup(res.text,"lxml")
        pdf_links = [item.get("href") for item in soup_obj.select("#content .item-list a[href*='minutes']")]
        for pdf_file in pdf_links:
            print(pdf_file)
            f.write(pdf_file+"
")

Webscraping pdf links - not returning results

Answers (2)

Related Questions