Yashwardhan kaul
Yashwardhan kaul

Reputation: 21

Going deeper into a website while web scraping

I am trying to scrape a bunch of websites for text so that I can cross validate with a corpus and display the number of hits particular words have on those websites. Can someone please help me with making my web scraper go deeper into the website automatically.

import requests
from bs4 import BeautifulSoup

url = 'https://www.theleela.com/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/'
page = requests.get(url)        #to extract page from website
html = page.content

soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

I call for all the links on the webpage like this :

links=[]
for link in soup.find_all('a'):
  a = link.get('href')
  if type(a) == str and "https:" not in a:
    links.append(a)
links

This is what I get:

['/en_us/offers/index',
 '/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru',
 '/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/overview',
 '/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/rooms-and-suites',
 '/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/offers',
 '/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/meetings',
 '/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/celebrations',
 '/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/dining',
 '/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/Spa',
 '/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai',
 '/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/overview',
 '/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/rooms-and-suites',
 '/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/offers',
 '/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/meetings',
 '/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/celebrations',
 '/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/dining',
 '/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/spa',
 '/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi',
 '/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/overview',
 '/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/rooms-and-suites',
 '/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/offers',
 '/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/meetings',
 '/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/celebrations',
 '/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/dining',
 '/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/spa',
 '/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi',
 '/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/overview',
 '/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/rooms-and-suites',
 '/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/offers',
 '/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/meetings',
 '/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/celebrations',
 '/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/dining',
 '/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/spa',
 '/en_us/hotels-in-goa/the-leela-goa-hotel',
 '/en_us/hotels-in-goa/the-leela-goa-hotel/overview',
 '/en_us/hotels-in-goa/the-leela-goa-hotel/rooms-and-suites',
 '/en_us/hotels-in-goa/the-leela-goa-hotel/offers',
 '/en_us/hotels-in-goa/the-leela-goa-hotel/meetings',
 '/en_us/hotels-in-goa/the-leela-goa-hotel/celebrations',
 '/en_us/hotels-in-goa/the-leela-goa-hotel/dining',
 '/en_us/hotels-in-goa/the-leela-goa-hotel/spa',
 '/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram',
 '/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/overview',
 '/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/rooms-and-suites',
 '/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/offers',
 '/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/meetings',
 '/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/celebrations',
 '/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/dining',
 '/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/spa',
 '/en_us/hotels-in-kovalam/the-leela-kovalam-hotel',
 '/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/overview',
 '/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/rooms-and-suites',
 '/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/offers',
 '/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/meetings',
 '/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/celebrations',
 '/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/dining',
 '/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/spa',
 '/en_us/hotels-in-mumbai/the-leela-mumbai-hotel',
 '/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/overview',
 '/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/rooms-and-suites',
 '/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/offers',
 '/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/meetings',
 '/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/celebrations',
 '/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/dining',
 '/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa',
 '/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur',
 '/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/overview',
 '/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/rooms-and-suites',
 '/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/offers',
 '/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/meetings',
 '/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/celebrations',
 '/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/dining',
 '/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/spa',
 'javascript:facebookLogin();',
 'javascript:forgot_password(this);',
 '/application/spring/myprofile/my-profile-edit',
 '/en_us',
 '/application/spring/myprofile/login',
 '/the-leela/best-rates-guaranteed',
 '#',
 'javascript:facebookLogin();',
 '/application/spring/myprofile/my-profile-edit',
 '/en_us',
 '/en_us/hotels-in-mumbai/the-leela-mumbai-hotel',
 '/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa',
 '/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/signature-spa-treatments-',
 '/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/holistic-treatments-',
 '/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/fitness',
 '/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/wellness',
 '/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/salon',
 '/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/signature-spa-treatments-',
 '/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/holistic-treatments-',
 '/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/fitness',
 '/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/wellness',
 '/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/salon',
 '/contentAsset/raw-data/d1e3f704-be84-4353-a95e-28629651db00/fileAsset',
 '/the-leela/about-the-leela/history',
 '/the-leela/about-the-leela/company-information',
 '/the-leela/about-the-leela/alliances',
 '/the-leela/about-the-leela/investor-relations',
 '/the-leela/about-the-leela/future-openings',
 'javascript:void(0);',
 '/the-leela/media/media-coverage',
 '/the-leela/media/press-releases',
 '/the-leela/media/media-contacts',
 '/the-leela/media/the-leela-magazine',
 '/the-leela/media/awards',
 '/the-leela/Loyalty/the-leela-discovery',
 '/the-leela/Loyalty/leela-solitaire-line',
 '/the-leela/Loyalty/connoisseur-club',
 '/the-leela/Loyalty/the-leela-preferred-partners-membership-program',
 '/the-leela/careers/opportunities',
 '/the-leela/contact-us/hotels',
 '/the-leela/contact-us/convention-centre',
 '/the-leela/contact-us/reservations',
 '/the-leela/contact-us/sales-marketing-offices',
 'javascript:void(0);',
 '/the-leela/others/art',
 '/the-leela/others/boutique',
 '/the-leela/termsConditions/legal',
 '/the-leela/termsConditions/siteMap',
 '/the-leela/termsConditions/privacy-policy',
 '/the-leela/termsConditions/general-terms-and-conditions']

As you can see there are still some irrelevant links here

'javascript:void(0);',
/application/spring/myprofile/login',
 '/the-leela/best-rates-guaranteed',
 '#',
 'javascript:facebookLogin();',
 '/application/spring/myprofile/my-profile-edit',
 '/en_us',

I need help getting rid of these so that I can run the scraper on a loop on the output list. Appreciate any help.

Upvotes: 2

Views: 206

Answers (1)

Vladimir
Vladimir

Reputation: 10503

I doubt there's a ready-available solution which is not a site-specific. Based on my experience with crawlers a few things came to my mind:

  • You may use site's sitemap page which is usually there for likeminded crawlers and would contain links to all the important pages the site owners want you to crawl. robots.txt could also be useful.
  • You may try to download all the pages and use mimetypes lib and/or even utilize Content-Type header
  • You may want to put some heuristic keywords or rules, like regular expressions, to prevent your crawler from reaching or crawling certain URLs.
  • Finally (if that's a huge multi-month project for many hundreds or thousands websites) you may try to further limit URLs using machine learning.

Upvotes: 1

Related Questions