user3115933
user3115933

Reputation: 4463

Using rvest to scrape specific values from a web page

I am using R with RStudio. I am trying to scrape data from a specific webpage using the rvest package. Below is a partial screenshot of the webpage with the values I am interested to scrape circled in Red.

screenshot

I am completely new to this HTML and Element thing and I am a having a hard time trying to figure out on how to use the relevant html tags in rvest. Using Chrome DevTools, I have been able to figure out where each of the items I need are located in the HTML codes.

I am providing the tags relevant to each item below:

Table Headers:

<thead style="width: 547px; top: 0px; z-index: auto;" class="">
<tr class="hprt-table-header">
<th class="hprt-table-header-cell -first" style="width: 134px;">
Accommodation Type
</th>
<th class="hprt-table-header-cell hprt-table-header-price" style="width: 89px;">
Today's Price
</th>
<th class="hprt-table-header-cell hprt-table-header-policies" style="width: 146px;">
Your Choices</th>

Standard Queen Room:

<a class="hprt-roomtype-link" href="#RD27576901" data-room-id="27576901" id="room_type_id_27576901" data-room-name="" data-et-click="">
<span class="hprt-roomtype-icon-link ">
Standard Queen Room
</span>

MUR 13,097:

<div class="bui-price-display__value prco-inline-block-maker-helper prco-f-font-heading " aria-hidden="true" data-et-mouseenter="
customGoal:cCcCcCDUfcXIFbcDcbNXGDJae:2
goal:desktop_room_list_price_column_hover_over_price
">
MUR&nbsp;13,097
</div>

All-Inclusive:

" id="b_tt_holder_5" aria-describedby="materialized_tooltip_1n6pi">
<span class="bicon-allinclusive mp-icon meal-plan-icon"></span>
<span class="ungreen_keep_green">
All-Inclusive
</span>

Superior Queen Room:

<a class="hprt-roomtype-link" href="#RD27576902" data-room-id="27576902" id="room_type_id_27576902" data-room-name="" data-et-click="">
<span class="hprt-roomtype-icon-link ">
Superior Queen Room
</span>

14,266:

<div class="bui-price-display__value prco-inline-block-maker-helper prco-f-font-heading " aria-hidden="true" data-et-mouseenter="
customGoal:cCcCcCDUfcXIFbcDcbNXGDJae:2
goal:desktop_room_list_price_column_hover_over_price
">
MUR&nbsp;14,266
</div>

All-Inclusive:

" id="b_tt_holder_9" aria-describedby="materialized_tooltip_n2p5s">
<span class="bicon-allinclusive mp-icon meal-plan-icon"></span>
<span class="ungreen_keep_green">
All-Inclusive
</span>

I would like to transform the output into a data frame as follows:

 Accommodation Type      Today's Price  Your Choices
 Standard Queen Room      MUR 13,907    All-Inclusive
 Superior Queen Room      MUR 14,266    All-Inclusive 

My R codes currently stand as follows:

if (!require(rvest)) install.packages('rvest')

library(rvest)

url1 <- read_html("https://www.booking.com/hotel/mu/tamassa.html?aid=356980;label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ;sid=729aafddc363c28a2c2c7379d7685d87;all_sr_blocks=36363601_246990918_2_85_0;checkin=2021-09-04;checkout=2021-09-05;dest_id=-1354779;dest_type=city;dist=0;from_beach_key_ufi_sr=1;group_adults=2;group_children=0;hapos=1;highlighted_blocks=36363601_246990918_2_85_0;hp_group_set=0;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=36363601_246990918_2_85_0__29200;srepoch=1619681695;srpvid=51c8354f03be0097;type=total;ucfs=1&")     

Any help would be highly appreciated.

Upvotes: 1

Views: 874

Answers (2)

Dave2e
Dave2e

Reputation: 24149

Here is solution retrieving the table of prices and then performing some data cleaning:

Still requires some additional clean-up but the majority is done.

library(rvest)
library(dplyr)
library(stringr)

url1 <- read_html("https://www.booking.com/hotel/mu/tamassa.html?aid=356980;label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ;sid=729aafddc363c28a2c2c7379d7685d87;all_sr_blocks=36363601_246990918_2_85_0;checkin=2021-09-04;checkout=2021-09-05;dest_id=-1354779;dest_type=city;dist=0;from_beach_key_ufi_sr=1;group_adults=2;group_children=0;hapos=1;highlighted_blocks=36363601_246990918_2_85_0;hp_group_set=0;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=36363601_246990918_2_85_0__29200;srepoch=1619681695;srpvid=51c8354f03be0097;type=total;ucfs=1&") 

output <- url1 %>% 
   html_nodes(xpath = './/table[@id="hprt-table"]')  %>% 
   html_table() %>% .[[1]]

    
#Fix column name
colnames(output)[5] <- "Quantity"

#Clean up columns
#remove repeating information in 2 columns
output2 <- output %>% mutate_at(c("Accommodation Type", "Today's price"), ~str_extract(., ".*\n"))
#Remove repeating newlines
answer<-output2 %>% mutate_all(str_squish)

answer
# A tibble: 8 x 5
  `Accommodation Ty… Sleeps           `Today's price` `Your choices`                                                                   Quantity                                                 
  <chr>              <chr>            <chr>           <chr>                                                                            <chr>                                                    
1 Triple Room        Max persons: 3   US$398          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$398) 2 (US$795) 3 (US$1,193) 4 (US$…
2 Triple Room        Max persons: 1 … US$313          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$313) 2 (US$626) 3 (US$939) 4 (US$1,…
3 Standard Queen Ro… Max persons: 2   US$325          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$325) 2 (US$650) 3 (US$976) 4 (US$1,…
4 Standard Queen Ro… Max persons: 1 … US$241          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$241) 2 (US$481) 3 (US$722) 4 (US$96…
5 Superior Queen Ro… Max persons: 2   US$354          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$354) 2 (US$708) 3 (US$1,063) 4 (US$…
6 Superior Queen Ro… Max persons: 1 … US$270          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$270) 2 (US$539) 3 (US$809) 4 (US$1,…
7 Deluxe Family Room Max persons: 2   US$532          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$532) 2 (US$1,064) 3 (US$1,596) 4 (U…
8 Deluxe Family Room Max persons: 1 … US$447          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$447) 2 (US$895) 3 (US$1,342) 4 (US$…

Upvotes: 0

Till
Till

Reputation: 6663

This is not a complete solution, as this is a rather complex task.

In general: You can select html tags/nodes with html_nodes() and by specifying their class or id argument. In your case I see no ids but there are classes. IDs would be prefixed with a # for classes you use ., e.g. ".hprt-table-header" (as used in the code below.) The code for extraction the text is pretty similar for each chunks of info you are after - just modify the code below for those. An issue that might be a bit harder is to figure out the rows that have more than one value for the "prices" and "choices".

library(rvest)
#> Loading required package: xml2

url1 <- read_html("https://www.booking.com/hotel/mu/tamassa.html?aid=356980;label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ;sid=729aafddc363c28a2c2c7379d7685d87;all_sr_blocks=36363601_246990918_2_85_0;checkin=2021-09-04;checkout=2021-09-05;dest_id=-1354779;dest_type=city;dist=0;from_beach_key_ufi_sr=1;group_adults=2;group_children=0;hapos=1;highlighted_blocks=36363601_246990918_2_85_0;hp_group_set=0;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=36363601_246990918_2_85_0__29200;srepoch=1619681695;srpvid=51c8354f03be0097;type=total;ucfs=1&")     

Table Headers

url1 %>% 
  html_nodes(".hprt-table-header") %>% 
  html_text() %>% 
  strsplit("\n") %>% 
  unlist() %>% 
  .[. != ""] %>% 
  gsub("\n", "", .) %>% 
  .[-5]
#> [1] "Accommodation Type" "Sleeps"             "Today's price"     
#> [4] "Your choices"       "Quantity"

Room Type

url1 %>% 
  html_nodes(".hprt-roomtype-icon-link") %>% 
  html_text() %>% 
  strsplit("\n") %>% 
  unlist() %>% 
  .[. != ""]
#> [1] "Standard Queen Room" "Superior Queen Room" "Deluxe Family Room" 
#> [4] "Triple Room"

Price

url1 %>% 
  html_nodes(".bui-price-display__value") %>% 
  html_text() %>% 
  strsplit("\n") %>% 
  unlist() %>% 
  .[. != ""] %>% 
  gsub("\n", "", .) 
#> [1] "US$325" "US$241" "US$354" "US$270" "US$532" "US$447" "US$398" "US$313"

Note that before scraping big amounts of data from a website you should confirm that you are note putting yourself in legal jeopardy.

Upvotes: 1

Related Questions