Reputation: 4463
I am using R
with RStudio
.
I am trying to scrape data from a specific webpage using the rvest package. Below is a partial screenshot of the webpage with the values I am interested to scrape circled in Red.
I am completely new to this HTML and Element thing and I am a having a hard time trying to figure out on how to use the relevant html tags in rvest
. Using Chrome DevTools, I have been able to figure out where each of the items I need are located in the HTML codes.
I am providing the tags relevant to each item below:
Table Headers:
<thead style="width: 547px; top: 0px; z-index: auto;" class="">
<tr class="hprt-table-header">
<th class="hprt-table-header-cell -first" style="width: 134px;">
Accommodation Type
</th>
<th class="hprt-table-header-cell hprt-table-header-price" style="width: 89px;">
Today's Price
</th>
<th class="hprt-table-header-cell hprt-table-header-policies" style="width: 146px;">
Your Choices</th>
Standard Queen Room:
<a class="hprt-roomtype-link" href="#RD27576901" data-room-id="27576901" id="room_type_id_27576901" data-room-name="" data-et-click="">
<span class="hprt-roomtype-icon-link ">
Standard Queen Room
</span>
MUR 13,097:
<div class="bui-price-display__value prco-inline-block-maker-helper prco-f-font-heading " aria-hidden="true" data-et-mouseenter="
customGoal:cCcCcCDUfcXIFbcDcbNXGDJae:2
goal:desktop_room_list_price_column_hover_over_price
">
MUR 13,097
</div>
All-Inclusive:
" id="b_tt_holder_5" aria-describedby="materialized_tooltip_1n6pi">
<span class="bicon-allinclusive mp-icon meal-plan-icon"></span>
<span class="ungreen_keep_green">
All-Inclusive
</span>
Superior Queen Room:
<a class="hprt-roomtype-link" href="#RD27576902" data-room-id="27576902" id="room_type_id_27576902" data-room-name="" data-et-click="">
<span class="hprt-roomtype-icon-link ">
Superior Queen Room
</span>
14,266:
<div class="bui-price-display__value prco-inline-block-maker-helper prco-f-font-heading " aria-hidden="true" data-et-mouseenter="
customGoal:cCcCcCDUfcXIFbcDcbNXGDJae:2
goal:desktop_room_list_price_column_hover_over_price
">
MUR 14,266
</div>
All-Inclusive:
" id="b_tt_holder_9" aria-describedby="materialized_tooltip_n2p5s">
<span class="bicon-allinclusive mp-icon meal-plan-icon"></span>
<span class="ungreen_keep_green">
All-Inclusive
</span>
I would like to transform the output into a data frame as follows:
Accommodation Type Today's Price Your Choices
Standard Queen Room MUR 13,907 All-Inclusive
Superior Queen Room MUR 14,266 All-Inclusive
My R
codes currently stand as follows:
if (!require(rvest)) install.packages('rvest')
library(rvest)
url1 <- read_html("https://www.booking.com/hotel/mu/tamassa.html?aid=356980;label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ;sid=729aafddc363c28a2c2c7379d7685d87;all_sr_blocks=36363601_246990918_2_85_0;checkin=2021-09-04;checkout=2021-09-05;dest_id=-1354779;dest_type=city;dist=0;from_beach_key_ufi_sr=1;group_adults=2;group_children=0;hapos=1;highlighted_blocks=36363601_246990918_2_85_0;hp_group_set=0;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=36363601_246990918_2_85_0__29200;srepoch=1619681695;srpvid=51c8354f03be0097;type=total;ucfs=1&")
Any help would be highly appreciated.
Upvotes: 1
Views: 874
Reputation: 24149
Here is solution retrieving the table of prices and then performing some data cleaning:
Still requires some additional clean-up but the majority is done.
library(rvest)
library(dplyr)
library(stringr)
url1 <- read_html("https://www.booking.com/hotel/mu/tamassa.html?aid=356980;label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ;sid=729aafddc363c28a2c2c7379d7685d87;all_sr_blocks=36363601_246990918_2_85_0;checkin=2021-09-04;checkout=2021-09-05;dest_id=-1354779;dest_type=city;dist=0;from_beach_key_ufi_sr=1;group_adults=2;group_children=0;hapos=1;highlighted_blocks=36363601_246990918_2_85_0;hp_group_set=0;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=36363601_246990918_2_85_0__29200;srepoch=1619681695;srpvid=51c8354f03be0097;type=total;ucfs=1&")
output <- url1 %>%
html_nodes(xpath = './/table[@id="hprt-table"]') %>%
html_table() %>% .[[1]]
#Fix column name
colnames(output)[5] <- "Quantity"
#Clean up columns
#remove repeating information in 2 columns
output2 <- output %>% mutate_at(c("Accommodation Type", "Today's price"), ~str_extract(., ".*\n"))
#Remove repeating newlines
answer<-output2 %>% mutate_all(str_squish)
answer
# A tibble: 8 x 5
`Accommodation Ty… Sleeps `Today's price` `Your choices` Quantity
<chr> <chr> <chr> <chr> <chr>
1 Triple Room Max persons: 3 US$398 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$398) 2 (US$795) 3 (US$1,193) 4 (US$…
2 Triple Room Max persons: 1 … US$313 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$313) 2 (US$626) 3 (US$939) 4 (US$1,…
3 Standard Queen Ro… Max persons: 2 US$325 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$325) 2 (US$650) 3 (US$976) 4 (US$1,…
4 Standard Queen Ro… Max persons: 1 … US$241 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$241) 2 (US$481) 3 (US$722) 4 (US$96…
5 Superior Queen Ro… Max persons: 2 US$354 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$354) 2 (US$708) 3 (US$1,063) 4 (US$…
6 Superior Queen Ro… Max persons: 1 … US$270 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$270) 2 (US$539) 3 (US$809) 4 (US$1,…
7 Deluxe Family Room Max persons: 2 US$532 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$532) 2 (US$1,064) 3 (US$1,596) 4 (U…
8 Deluxe Family Room Max persons: 1 … US$447 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$447) 2 (US$895) 3 (US$1,342) 4 (US$…
Upvotes: 0
Reputation: 6663
This is not a complete solution, as this is a rather complex task.
In general: You can select html tags/nodes with html_nodes()
and by specifying their class
or id
argument. In your case I see no id
s but there are classes. IDs would be prefixed with a #
for classes you use .
, e.g. ".hprt-table-header"
(as used in the code below.) The code for extraction the text is pretty similar for each chunks of info you are after - just modify the code below for those. An issue that might be a bit harder is to figure out the rows that have more than one value for the "prices" and "choices".
library(rvest)
#> Loading required package: xml2
url1 <- read_html("https://www.booking.com/hotel/mu/tamassa.html?aid=356980;label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ;sid=729aafddc363c28a2c2c7379d7685d87;all_sr_blocks=36363601_246990918_2_85_0;checkin=2021-09-04;checkout=2021-09-05;dest_id=-1354779;dest_type=city;dist=0;from_beach_key_ufi_sr=1;group_adults=2;group_children=0;hapos=1;highlighted_blocks=36363601_246990918_2_85_0;hp_group_set=0;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=36363601_246990918_2_85_0__29200;srepoch=1619681695;srpvid=51c8354f03be0097;type=total;ucfs=1&")
url1 %>%
html_nodes(".hprt-table-header") %>%
html_text() %>%
strsplit("\n") %>%
unlist() %>%
.[. != ""] %>%
gsub("\n", "", .) %>%
.[-5]
#> [1] "Accommodation Type" "Sleeps" "Today's price"
#> [4] "Your choices" "Quantity"
url1 %>%
html_nodes(".hprt-roomtype-icon-link") %>%
html_text() %>%
strsplit("\n") %>%
unlist() %>%
.[. != ""]
#> [1] "Standard Queen Room" "Superior Queen Room" "Deluxe Family Room"
#> [4] "Triple Room"
url1 %>%
html_nodes(".bui-price-display__value") %>%
html_text() %>%
strsplit("\n") %>%
unlist() %>%
.[. != ""] %>%
gsub("\n", "", .)
#> [1] "US$325" "US$241" "US$354" "US$270" "US$532" "US$447" "US$398" "US$313"
Note that before scraping big amounts of data from a website you should confirm that you are note putting yourself in legal jeopardy.
Upvotes: 1