Reputation: 131
I am web-scraping lot of pdfs of committee meetings off a local government website. (https://www.gmcameetings.co.uk/) Therefore there are links.. within links... within links. I can successfully scrape all the 'a' tags from the main area of the page (the ones that I want), but when I try and scrape anything within them I get the error in the title of the question: AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()? How do I fix this?
I am completely new to coding and started an internship yesterday for which I am expected to web-scrape this information. The woman I'm supposed to be working with is not here for another couple of days and nobody else can help me - so please bear with me and be kind as I am a complete beginner and doing this alone. I know I have set up the first part of the code correctly as I can download the the whole page or download any particular links. Again, it's when I try and scrape within the links I have already (and successfully scraped) that I get the above error message. I think (with the little knowledge I know) that it's because of the 'output' of the 'all_links' which comes out as below. I have tried both find() and findAll() which both result in the same error message.
#the error message
date_links_area = all_links.find('ul',{"class":"item-list item-list--
rich"})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\rache\AppData\Local\Programs\Python\Python37-32\lib\site-
packages\bs4\element.py", line 1620, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list
of items like a single item. Did you call find_all() when you meant to
call
find()?" % key
AttributeError: ResultSet object has no attribute 'find'. You're probably
treating a list of items like a single item. Did you call find_all() when
you meant to call find()?
#output of all_links looks like this (this is only part of it)
href="https://www.gmcameetings.co.uk/info/20180/live_meetings/199/membership_201819">Members of the GMCA 2018/19, Greater Manchester Combined Authority Constitution, Meeting papers,
Some of those links then go to a page that has a list of dates - which is the area of the page I'm trying to get to. Then within that area I need to get the links with the dates. Then within them I need to grab the pdfs I want. Apologies if this doesn't make sense. I'm trying my best to do this on my own with zero experience.
Upvotes: 3
Views: 188
Reputation: 71471
This solution uses recursion to continuously scrape the links on each page until the PDF urls are discovered:
from bs4 import BeautifulSoup as soup
import requests
def scrape(url):
try:
for i in soup(requests.get(url).text, 'html.parser').find('main', {'id':'content'}).find_all('a'):
if '/downloads/meeting/' in i['href'] or '/downloads/file/' in i['href']:
yield i
elif i['href'].startswith('https://www.gmcameetings.co.uk'):
yield from scrape(i['href'])
except:
pass
urls = list(scrape('https://www.gmcameetings.co.uk/'))
Upvotes: 2
Reputation: 1734
The error is actually telling you what the problem is. all_links
is a list (ResultSet object) of HTML elements you found. You need to iterate the list and call find on each one:
sub_links = [all_links.find('ul',{"class":"item-list item-list--
rich"}) for link in all_links]
Upvotes: 0