Rachel9866
Rachel9866

Reputation: 131

Scraping links within links already scraped with python

I am web-scraping lot of pdfs of committee meetings off a local government website. (https://www.gmcameetings.co.uk/) Therefore there are links.. within links... within links. I can successfully scrape all the 'a' tags from the main area of the page (the ones that I want), but when I try and scrape anything within them I get the error in the title of the question: AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()? How do I fix this?

I am completely new to coding and started an internship yesterday for which I am expected to web-scrape this information. The woman I'm supposed to be working with is not here for another couple of days and nobody else can help me - so please bear with me and be kind as I am a complete beginner and doing this alone. I know I have set up the first part of the code correctly as I can download the the whole page or download any particular links. Again, it's when I try and scrape within the links I have already (and successfully scraped) that I get the above error message. I think (with the little knowledge I know) that it's because of the 'output' of the 'all_links' which comes out as below. I have tried both find() and findAll() which both result in the same error message.

 #the error message
 date_links_area = all_links.find('ul',{"class":"item-list item-list-- 
 rich"})
 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "C:\Users\rache\AppData\Local\Programs\Python\Python37-32\lib\site- 
 packages\bs4\element.py", line 1620, in __getattr__
 "ResultSet object has no attribute '%s'. You're probably treating a list 
 of items like a single item. Did you call find_all() when you meant to 
 call 
 find()?" % key
 AttributeError: ResultSet object has no attribute 'find'. You're probably 
 treating a list of items like a single item. Did you call find_all() when 
 you meant to call find()?

#output of all_links looks like this (this is only part of it)

href="https://www.gmcameetings.co.uk/info/20180/live_meetings/199/membership_201819">Members of the GMCA 2018/19, Greater Manchester Combined Authority Constitution, Meeting papers,

Some of those links then go to a page that has a list of dates - which is the area of the page I'm trying to get to. Then within that area I need to get the links with the dates. Then within them I need to grab the pdfs I want. Apologies if this doesn't make sense. I'm trying my best to do this on my own with zero experience.

Upvotes: 3

Views: 188

Answers (2)

Ajax1234
Ajax1234

Reputation: 71471

This solution uses recursion to continuously scrape the links on each page until the PDF urls are discovered:

from bs4 import BeautifulSoup as soup
import requests
def scrape(url):
  try:
    for i in soup(requests.get(url).text, 'html.parser').find('main', {'id':'content'}).find_all('a'):
      if '/downloads/meeting/' in i['href'] or '/downloads/file/' in i['href']:
         yield i
      elif i['href'].startswith('https://www.gmcameetings.co.uk'):
         yield from scrape(i['href'])
  except:
      pass

urls = list(scrape('https://www.gmcameetings.co.uk/'))

Upvotes: 2

facelessuser
facelessuser

Reputation: 1734

The error is actually telling you what the problem is. all_links is a list (ResultSet object) of HTML elements you found. You need to iterate the list and call find on each one:

sub_links = [all_links.find('ul',{"class":"item-list item-list-- 
 rich"}) for link in all_links]

Upvotes: 0

Related Questions