Reputation: 383
I encounter a big problem in scrapping of HTML table with nested columns.
The table is from the immigration department of Hong Kong.
A screenshot is shown here:
I tried to do it with rvest, but the result is messy.
library(rvest)
library(tidyverse)
library(stringr)
library(dplyr)
url_data <- "https://www.immd.gov.hk/eng/stat_20220901.html"
url_data %>%
read_html()
css_selector <- "body > section:nth-child(7) > div > div > div > div > table"
immiTable <- url_data %>%
read_html() %>% html_element(css = css_selector) %>% html_table()
immiTable
My goal is to extract the first row (i.e. Airport) and plot it to a pie chart, and produce a dataframe of the whole table and save it to excel.
I realize that teaching material for unnest table and scrapping nested table is rather scarce. Therefore I need your guidance. Thank you very much for your help.
Upvotes: 3
Views: 319
Reputation: 76402
Here is a way. The headers format complicates things but the code below works. It extracts the entire table, not just the first row.
suppressPackageStartupMessages({
library(rvest)
library(dplyr)
library(ggplot2)
})
url_data <- "https://www.immd.gov.hk/eng/stat_20220901.html"
page <- url_data %>% read_html()
page %>%
html_elements("[headers='Arrival']") %>%
html_text() %>%
paste("Arrival", .) -> col_names
page %>%
html_elements("[headers='Departure']") %>%
html_text() %>%
paste("Departure", .) %>%
c(col_names, .) -> col_names
page %>%
html_elements("[headers='Control_Point']") %>%
html_text() -> row_names
page %>%
html_elements("[class='hRight']") %>%
html_text() %>%
sub(",", "", .) %>%
as.numeric() %>%
matrix(nrow = length(row_names), byrow = TRUE) %>%
as.data.frame() %>%
setNames(col_names) %>%
`row.names<-`(row_names) -> final
final
#> Arrival Hong Kong Residents
#> Airport 4258
#> Express Rail Link West Kowloon 0
#> Hung Hom 0
#> Lo Wu 0
#> Lok Ma Chau Spur Line 0
#> Heung Yuen Wai 0
#> Hong Kong-Zhuhai-Macao Bridge 333
#> Lok Ma Chau 0
#> Man Kam To 0
#> Sha Tau Kok 0
#> Shenzhen Bay 3404
#> China Ferry Terminal 0
#> Harbour Control 0
#> Kai Tak Cruise Terminal 0
#> Macau Ferry Terminal 0
#> Total 7995
#> Arrival Mainland Visitors Arrival Other Visitors
#> Airport 1488 422
#> Express Rail Link West Kowloon 0 0
#> Hung Hom 0 0
#> Lo Wu 0 0
#> Lok Ma Chau Spur Line 0 0
#> Heung Yuen Wai 0 0
#> Hong Kong-Zhuhai-Macao Bridge 28 39
#> Lok Ma Chau 0 0
#> Man Kam To 0 0
#> Sha Tau Kok 0 0
#> Shenzhen Bay 348 37
#> China Ferry Terminal 0 0
#> Harbour Control 0 0
#> Kai Tak Cruise Terminal 0 0
#> Macau Ferry Terminal 0 0
#> Total 1864 498
#> Arrival Total Departure Hong Kong Residents
#> Airport 6168 3775
#> Express Rail Link West Kowloon 0 0
#> Hung Hom 0 0
#> Lo Wu 0 0
#> Lok Ma Chau Spur Line 0 0
#> Heung Yuen Wai 0 0
#> Hong Kong-Zhuhai-Macao Bridge 400 243
#> Lok Ma Chau 0 0
#> Man Kam To 0 0
#> Sha Tau Kok 0 0
#> Shenzhen Bay 3789 1301
#> China Ferry Terminal 0 0
#> Harbour Control 0 0
#> Kai Tak Cruise Terminal 0 0
#> Macau Ferry Terminal 0 0
#> Total 10357 5319
#> Departure Mainland Visitors
#> Airport 1154
#> Express Rail Link West Kowloon 0
#> Hung Hom 0
#> Lo Wu 0
#> Lok Ma Chau Spur Line 0
#> Heung Yuen Wai 0
#> Hong Kong-Zhuhai-Macao Bridge 194
#> Lok Ma Chau 0
#> Man Kam To 0
#> Sha Tau Kok 0
#> Shenzhen Bay 524
#> China Ferry Terminal 0
#> Harbour Control 0
#> Kai Tak Cruise Terminal 0
#> Macau Ferry Terminal 0
#> Total 1872
#> Departure Other Visitors Departure Total
#> Airport 315 5244
#> Express Rail Link West Kowloon 0 0
#> Hung Hom 0 0
#> Lo Wu 0 0
#> Lok Ma Chau Spur Line 0 0
#> Heung Yuen Wai 0 0
#> Hong Kong-Zhuhai-Macao Bridge 15 452
#> Lok Ma Chau 0 0
#> Man Kam To 0 0
#> Sha Tau Kok 0 0
#> Shenzhen Bay 28 1853
#> China Ferry Terminal 0 0
#> Harbour Control 0 0
#> Kai Tak Cruise Terminal 0 0
#> Macau Ferry Terminal 0 0
#> Total 358 7549
Created on 2022-09-18 with reprex v2.0.2
To plot the pie chart in ggplot
plot a bar chart then change to polar coordinates.
Airport <- final[1,,]
Airport %>%
t() %>%
as.data.frame() %>%
mutate(`Arrival/Departure` = row.names(.)) %>%
ggplot(aes("", Airport, fill = `Arrival/Departure`)) +
geom_col(width = 1) +
scale_fill_manual(values = RColorBrewer::brewer.pal(n = 8, name = "Spectral")) +
coord_polar(theta = "y", start = 0) +
theme_void()
Created on 2022-09-18 with reprex v2.0.2
Upvotes: 3
Reputation: 84465
An alternative would be to select the tbody rows, filtering out the hidden items by attribute, then add in the headers later.
library(rvest)
library(tidyverse)
rows <- read_html("https://www.immd.gov.hk/eng/stat_20220901.html") %>% html_elements(".table-passengerTrafficStat tbody tr")
prefixes <- c("arr", "dep")
cols <- c("Hong Kong Residents", "Mainland Visitors", "Other Visitors", "Total")
headers <- c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())
df <- map_dfr(
rows,
function(x) {
x %>%
html_elements("td[headers]") %>%
set_names(headers) %>%
html_text()
}
) %>%
mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer()))
Or somewhat condensed,
library(rvest)
library(tidyverse)
rows <- read_html("https://www.immd.gov.hk/eng/stat_20220901.html") %>% html_elements(".table-passengerTrafficStat tbody tr")
prefixes <- c("arr", "dep")
cols <- c("Hong Kong Residents", "Mainland Visitors", "Other Visitors", "Total")
headers <- c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())
df <- map_dfr(rows, ~ set_names(.x %>% html_elements("td[headers]") %>% html_text(), headers)) %>%
mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer()))
Upvotes: 2