Reputation: 131
I have set up some code to scrape the pdfs off a local council website. I've requested the page I want, then got the links to different dates, then within each of them the links to the pdfs. However it's not returning any results.
I've played around with the code and can't figure it out. It's running ok in jupyter notebook and not returning any errors.
This is my code:
import requests
from bs4 import BeautifulSoup as bs
dates = ['April 2019', 'July 2019', 'December 2018']
r = requests.get('https://www.gmcameetings.co.uk/meetings/committee/36/economy_business_growth_and_skills_overview_and_scrutiny')
soup = bs(r.content, 'lxml')
f = open(r"E:\Internship\WORK\GMCA\Getting PDFS\gmcabusinessdatelinks.txt", "w+")
for date in dates:
if ['a'] in soup.select('a:contains("' + date + '")'):
r2 = requests.get(date['href'])
print("link1")
page2 = r2.text
soup2 = bs(page2, 'lxml')
pdf_links = soup2.find_all('a', href=True)
for plink in pdf_links:
if plink['href'].find('minutes')>1:
print("Minutes!")
f.write(str(plink['href']) + ' ')
f.close()
It creates a text file but it's blank. I want a text file with all of the links to the pdfs. Thanks.
Upvotes: 0
Views: 54
Reputation: 28565
can use regex soup.find('a', text=re.compile(date))
instead:
import requests
from bs4 import BeautifulSoup as bs
import re
dates = ['April 2019', 'July 2019', 'December 2018']
r = requests.get('https://www.gmcameetings.co.uk/meetings/committee/36/economy_business_growth_and_skills_overview_and_scrutiny')
soup = bs(r.content, 'lxml')
f = open(r"E:\gmcabusinessdatelinks.txt", "w+")
for date in dates:
link = soup.find('a', text=re.compile(date))
r2 = requests.get(link['href'])
print("link1")
page2 = r2.text
soup2 = bs(page2, 'lxml')
pdf_links = soup2.find_all('a', href=True)
for plink in pdf_links:
if plink['href'].find('minutes')>1:
print("Minutes!")
f.write(str(plink['href']) + ' ')
f.close()
Upvotes: 1
Reputation: 22440
If you wanted to get the pdf links containing minutes
keyword then the following should work:
import requests
from bs4 import BeautifulSoup
link = 'https://www.gmcameetings.co.uk/meetings/committee/36/economy_business_growth_and_skills_overview_and_scrutiny'
dates = ['April 2019', 'July 2019', 'December 2018']
r = requests.get(link)
soup = BeautifulSoup(r.text, 'lxml')
target_links = [[i['href'] for i in soup.select(f'a:contains("{date}")')] for date in dates]
with open("output_file.txt","w",encoding="utf-8") as f:
for target_link in target_links:
res = requests.get(target_link[0])
soup_obj = BeautifulSoup(res.text,"lxml")
pdf_links = [item.get("href") for item in soup_obj.select("#content .item-list a[href*='minutes']")]
for pdf_file in pdf_links:
print(pdf_file)
f.write(pdf_file+"\n")
Upvotes: 1