Reputation: 21
I am trying to scrape a bunch of websites for text so that I can cross validate with a corpus and display the number of hits particular words have on those websites. Can someone please help me with making my web scraper go deeper into the website automatically.
import requests
from bs4 import BeautifulSoup
url = 'https://www.theleela.com/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/'
page = requests.get(url) #to extract page from website
html = page.content
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
I call for all the links on the webpage like this :
links=[]
for link in soup.find_all('a'):
a = link.get('href')
if type(a) == str and "https:" not in a:
links.append(a)
links
This is what I get:
['/en_us/offers/index',
'/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru',
'/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/overview',
'/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/rooms-and-suites',
'/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/offers',
'/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/meetings',
'/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/celebrations',
'/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/dining',
'/en_us/hotels-in-bengaluru/the-leela-palace-hotel-bengaluru/Spa',
'/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai',
'/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/overview',
'/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/rooms-and-suites',
'/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/offers',
'/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/meetings',
'/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/celebrations',
'/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/dining',
'/en_us/hotels-in-chennai/the-leela-palace-hotel-chennai/spa',
'/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi',
'/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/overview',
'/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/rooms-and-suites',
'/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/offers',
'/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/meetings',
'/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/celebrations',
'/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/dining',
'/en_us/hotels-in-delhi/the-leela-palace-hotel-new-delhi/spa',
'/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi',
'/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/overview',
'/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/rooms-and-suites',
'/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/offers',
'/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/meetings',
'/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/celebrations',
'/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/dining',
'/en_us/hotels-in-delhi/the-leela-ambience-convention-hotel-delhi/spa',
'/en_us/hotels-in-goa/the-leela-goa-hotel',
'/en_us/hotels-in-goa/the-leela-goa-hotel/overview',
'/en_us/hotels-in-goa/the-leela-goa-hotel/rooms-and-suites',
'/en_us/hotels-in-goa/the-leela-goa-hotel/offers',
'/en_us/hotels-in-goa/the-leela-goa-hotel/meetings',
'/en_us/hotels-in-goa/the-leela-goa-hotel/celebrations',
'/en_us/hotels-in-goa/the-leela-goa-hotel/dining',
'/en_us/hotels-in-goa/the-leela-goa-hotel/spa',
'/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram',
'/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/overview',
'/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/rooms-and-suites',
'/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/offers',
'/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/meetings',
'/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/celebrations',
'/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/dining',
'/en_us/hotels-in-gurugram/the-leela-ambience-hotel-gurugram/spa',
'/en_us/hotels-in-kovalam/the-leela-kovalam-hotel',
'/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/overview',
'/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/rooms-and-suites',
'/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/offers',
'/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/meetings',
'/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/celebrations',
'/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/dining',
'/en_us/hotels-in-kovalam/the-leela-kovalam-hotel/spa',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/overview',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/rooms-and-suites',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/offers',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/meetings',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/celebrations',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/dining',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa',
'/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur',
'/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/overview',
'/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/rooms-and-suites',
'/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/offers',
'/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/meetings',
'/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/celebrations',
'/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/dining',
'/en_us/hotels-in-udaipur/the-leela-palace-hotel-udaipur/spa',
'javascript:facebookLogin();',
'javascript:forgot_password(this);',
'/application/spring/myprofile/my-profile-edit',
'/en_us',
'/application/spring/myprofile/login',
'/the-leela/best-rates-guaranteed',
'#',
'javascript:facebookLogin();',
'/application/spring/myprofile/my-profile-edit',
'/en_us',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/signature-spa-treatments-',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/holistic-treatments-',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/fitness',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/wellness',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/salon',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/signature-spa-treatments-',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/holistic-treatments-',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/fitness',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/wellness',
'/en_us/hotels-in-mumbai/the-leela-mumbai-hotel/spa/salon',
'/contentAsset/raw-data/d1e3f704-be84-4353-a95e-28629651db00/fileAsset',
'/the-leela/about-the-leela/history',
'/the-leela/about-the-leela/company-information',
'/the-leela/about-the-leela/alliances',
'/the-leela/about-the-leela/investor-relations',
'/the-leela/about-the-leela/future-openings',
'javascript:void(0);',
'/the-leela/media/media-coverage',
'/the-leela/media/press-releases',
'/the-leela/media/media-contacts',
'/the-leela/media/the-leela-magazine',
'/the-leela/media/awards',
'/the-leela/Loyalty/the-leela-discovery',
'/the-leela/Loyalty/leela-solitaire-line',
'/the-leela/Loyalty/connoisseur-club',
'/the-leela/Loyalty/the-leela-preferred-partners-membership-program',
'/the-leela/careers/opportunities',
'/the-leela/contact-us/hotels',
'/the-leela/contact-us/convention-centre',
'/the-leela/contact-us/reservations',
'/the-leela/contact-us/sales-marketing-offices',
'javascript:void(0);',
'/the-leela/others/art',
'/the-leela/others/boutique',
'/the-leela/termsConditions/legal',
'/the-leela/termsConditions/siteMap',
'/the-leela/termsConditions/privacy-policy',
'/the-leela/termsConditions/general-terms-and-conditions']
As you can see there are still some irrelevant links here
'javascript:void(0);',
/application/spring/myprofile/login',
'/the-leela/best-rates-guaranteed',
'#',
'javascript:facebookLogin();',
'/application/spring/myprofile/my-profile-edit',
'/en_us',
I need help getting rid of these so that I can run the scraper on a loop on the output list. Appreciate any help.
Upvotes: 2
Views: 206
Reputation: 10503
I doubt there's a ready-available solution which is not a site-specific. Based on my experience with crawlers a few things came to my mind:
robots.txt
could also be useful.mimetypes
lib and/or even utilize Content-Type
headerUpvotes: 1