Rachel9866
Rachel9866

Reputation: 131

Webscraping pdf links - not returning results

I have set up some code to scrape the pdfs off a local council website. I've requested the page I want, then got the links to different dates, then within each of them the links to the pdfs. However it's not returning any results.

I've played around with the code and can't figure it out. It's running ok in jupyter notebook and not returning any errors.

This is my code:

import requests
from bs4 import BeautifulSoup as bs

dates = ['April 2019', 'July 2019', 'December 2018']
r = requests.get('https://www.gmcameetings.co.uk/meetings/committee/36/economy_business_growth_and_skills_overview_and_scrutiny')
soup = bs(r.content, 'lxml')

f = open(r"E:\Internship\WORK\GMCA\Getting PDFS\gmcabusinessdatelinks.txt", "w+")

for date in dates:
        if ['a'] in soup.select('a:contains("' + date + '")'):
            r2 = requests.get(date['href'])
            print("link1")
            page2 = r2.text
            soup2 = bs(page2, 'lxml')
            pdf_links = soup2.find_all('a', href=True)
            for plink in pdf_links:
                if plink['href'].find('minutes')>1:
                    print("Minutes!")
                    f.write(str(plink['href']) + ' ')
f.close()               

It creates a text file but it's blank. I want a text file with all of the links to the pdfs. Thanks.

Upvotes: 0

Views: 54

Answers (2)

chitown88
chitown88

Reputation: 28565

can use regex soup.find('a', text=re.compile(date)) instead:

import requests
from bs4 import BeautifulSoup as bs
import re

dates = ['April 2019', 'July 2019', 'December 2018']
r = requests.get('https://www.gmcameetings.co.uk/meetings/committee/36/economy_business_growth_and_skills_overview_and_scrutiny')
soup = bs(r.content, 'lxml')

f = open(r"E:\gmcabusinessdatelinks.txt", "w+")

for date in dates:
        link = soup.find('a', text=re.compile(date))
        r2 = requests.get(link['href'])
        print("link1")
        page2 = r2.text
        soup2 = bs(page2, 'lxml')
        pdf_links = soup2.find_all('a', href=True)
        for plink in pdf_links:
            if plink['href'].find('minutes')>1:
                print("Minutes!")
                f.write(str(plink['href']) + ' ')
f.close()               

Upvotes: 1

SIM
SIM

Reputation: 22440

If you wanted to get the pdf links containing minutes keyword then the following should work:

import requests
from bs4 import BeautifulSoup

link = 'https://www.gmcameetings.co.uk/meetings/committee/36/economy_business_growth_and_skills_overview_and_scrutiny'

dates = ['April 2019', 'July 2019', 'December 2018']

r = requests.get(link)
soup = BeautifulSoup(r.text, 'lxml')
target_links = [[i['href'] for i in soup.select(f'a:contains("{date}")')] for date in dates]

with open("output_file.txt","w",encoding="utf-8") as f:
    for target_link in target_links:

        res = requests.get(target_link[0])
        soup_obj = BeautifulSoup(res.text,"lxml")
        pdf_links = [item.get("href") for item in soup_obj.select("#content .item-list a[href*='minutes']")]
        for pdf_file in pdf_links:
            print(pdf_file)
            f.write(pdf_file+"\n")

Upvotes: 1

Related Questions