Reputation: 458
I wan't to scrape hotel price from booking.com But can't figure out why empty list i returning while searching for class using beautifulsoup4. My code is given here.
import webbrowser, requests
from bs4 import BeautifulSoup
res = requests.get("http://www.booking.com/searchresults.html?label=gen173nr-1FCAEoggJCAlhYSDNiBW5vcmVmaGyIAQGYATG4AQjIAQzYAQHoAQH4AQKoAgM&sid=c24fad210186ae699e89a0d3cab10039&dcid=4&checkin_monthday=18&checkin_year_month=2016-6&checkout_monthday=19&checkout_year_month=2016-6&class_interval=1&dest_id=-2092511&dest_type=city&group_adults=2&group_children=0&hlrd=0&label_click=undef&nflt=ht_id%3D204%3B&no_rooms=1&review_score_group=empty&room1=A%2CA&sb_price_type=total&sb_travel_purpose=business&score_min=0&src_elem=sb&ss=Kolkata%2C%20West%20Bengal%2C%20India&ss_raw=kolka&ssb=empty&order=score")
res.status_code
soup = BeautifulSoup(res.text,"lxml")
name = []
rating = []
hotel_name = soup.select('.sr-hotel__name')
hotel_price = soup.select('tr', class_='roomPrice')
hotel_rating = soup.select('.js--hp-scorecard-scoreval')
print hotel_price
for i in range(0, 10):
name.append(hotel_name[i].contents[0])
rating.append(hotel_rating[i].contents[0])
#print name[i]
#print rating[i]
Upvotes: 1
Views: 1844
Reputation: 180461
I had to do two things, 1. add a user-agent, 2. change the selectors, the source when scraped is actually different to what you see when you right click and pick view source in the browser:
In [7]: head = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}
In [8]: url = "http://www.booking.com/searchresults.html?label=gen173nr-1FCAEoggJCAlhYSDNiBW5vcmVmaGyIAQGYATG4AQjIAQzYAQHoAQH4AQKoAgM&sid=c24fad210186ae699e89a0d3cab10039&dcid=4&checkin_monthday=18&checkin_year_month=2016-6&checkout_monthday=19&checkout_year_month=2016-6&class_interval=1&dest_id=-2092511&dest_type=city&group_adults=2&group_children=0&hlrd=0&label_click=undef&nflt=ht_id%3D204%3B&no_rooms=1&review_score_group=empty&room1=A%2CA&sb_price_type=total&sb_travel_purpose=business&score_min=0&src_elem=sb&ss=Kolkata%2C%20West%20Bengal%2C%20India&ss_raw=kolka&ssb=empty&order=score"
In [9]: res = requests.get(url, headers=head)
In [10]: soup = BeautifulSoup(res.text,"html.parser")
In [11]: hotels = soup.select("#hotellist_inner div.sr_item.sr_item_new")
In [12]: for hotel in hotels:
....: name = hotel.select_one("span.sr-hotel__name").text.strip() ....: print(name)
....: score = hotel.select_one("span.average.js--hp-scorecard-scoreval")
....: print(score.text.strip())
....: price = hotel.select_one("table div.sr-prc--num.sr-prc--final")
....: print(price.text.strip() if price else "Unavailable")
....:
The Oberoi Grand Kolkata
9.0
€ 113
Taj Bengal
9.0
€ 113
Sapphire Suites
7.4
Unavailable
The Gateway Hotel EM Bypass Kolkata
8.6
€ 84
The Lalit Great Eastern Kolkata
8.6
€ 101
Swissôtel Kolkata
8.5
€ 86
Kenilworth Hotel
8.5
€ 78
The Fern Residency Kolkata
8.4
€ 84
ITC Sonar Kolkata A Luxury Collection Hotel
8.3
€ 116
Hyatt Regency
8.3
€ 63
Treebo Platinum
8.2
€ 38
The Corner Courtyard
8.2
€ 73
Jameson Inn Shiraz
8.0
€ 58
The Sonnet
7.9
€ 80
Hotel Casa Fortuna
7.9
€ 56
Pipal Tree Hotel
7.9
€ 77
Also the syntax for your select soup.select('tr', class_='roomPrice')
is incorrect, it would be soup.select('tr.roomPrice')
.
But the output above and indeed if you go to the page does not order by score, what we need to do is use the base url and pass params:
In [20]: params = {'checkin_year_month':'2016-6',
....: 'checkout_monthday':'19',
....: 'checkout_year_month':'2016-6',
....: 'class_interval':'1',
....: 'dest_id':'-2092511',
....: 'dest_type':'city',
....: 'dtdisc':'0',
....: 'group_adults':'2',
....: 'group_children':'0',
....: 'hlrd':'0',
....: 'hyb_red':'0',
....: 'inac':'0',
....: 'label_click':'undef',
....: 'nflt':'ht_id=204;',
....: 'nha_red':'0',
....: 'no_rooms':'1',
....: 'offset':'0',
....: 'order':'score',
....: 'postcard':'0',
....: 'redirected_from_city':'0',
....: 'redirected_from_landmark':'0',
....: 'redirected_from_region':'0',
....: 'review_score_group':'empty',
....: 'room1':'A,A',
....: 'sb_price_type':'total',
....: 'sb_travel_purpose':'business',
....: 'score_min':'0',
....: 'src_elem':'sb',
....: 'ss':'Kolkata, West Bengal, India',
....: 'ss_all':'0',
....: 'ss_raw':'kolka',
....: 'ssb':'empty',
....: 'sshis':'0'}
In [21]: head = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}
In [22]: url = "http://www.booking.com/searchresults.html"
In [23]: res = requests.get(url, params=params, headers=head)
In [24]: soup = BeautifulSoup(res.text,"html.parser")
In [25]: hotels = soup.select("#hotellist_inner div.sr_item.sr_item_new")
In [26]: for hotel in hotels:
....: name = hotel.select_one("span.sr-hotel__name").text.strip() ....: print(name)
....: score = hotel.select_one("span.average.js--hp-scorecard-scoreval")
....: print(score.text.strip())
....: price = hotel.select_one("table div.sr-prc--num.sr-prc--final")
....: print(price.text.strip() if price else "Unavailable")
....:
The Oberoi Grand Kolkata
9.0
Unavailable
Taj Bengal
9.0
Unavailable
The Lalit Great Eastern Kolkata
8.6
Unavailable
The Gateway Hotel EM Bypass Kolkata
8.6
Unavailable
Swissôtel Kolkata
8.5
Unavailable
Kenilworth Hotel
8.5
Unavailable
The Fern Residency Kolkata
8.4
Unavailable
ITC Sonar Kolkata A Luxury Collection Hotel
8.3
Unavailable
Hyatt Regency
8.3
Unavailable
Treebo Platinum
8.2
Unavailable
The Corner Courtyard
8.2
Unavailable
Monovilla Inn
8.1
Unavailable
Jameson Inn Shiraz
8.0
Unavailable
The Sonnet
7.9
Unavailable
Hotel Casa Fortuna
7.9
Unavailable
That brings use here where the prices are hidden so we need to add a bit more logic, I will edit the answer in a bit.
Upvotes: 2