ronzenith
ronzenith

Reputation: 383

How to scrape HTML table with nested column with Rvest?

I encounter a big problem in scrapping of HTML table with nested columns.

The table is from the immigration department of Hong Kong.

A screenshot is shown here:

enter image description here

I tried to do it with rvest, but the result is messy.

library(rvest)
library(tidyverse)
library(stringr)
library(dplyr)

url_data <- "https://www.immd.gov.hk/eng/stat_20220901.html"

url_data %>%
read_html()
css_selector <- "body > section:nth-child(7) > div > div > div > div > table"
immiTable <- url_data %>% 
read_html() %>% html_element(css = css_selector) %>% html_table()
immiTable

enter image description here

My goal is to extract the first row (i.e. Airport) and plot it to a pie chart, and produce a dataframe of the whole table and save it to excel.

I realize that teaching material for unnest table and scrapping nested table is rather scarce. Therefore I need your guidance. Thank you very much for your help.

Upvotes: 3

Views: 319

Answers (2)

Rui Barradas
Rui Barradas

Reputation: 76402

Here is a way. The headers format complicates things but the code below works. It extracts the entire table, not just the first row.

suppressPackageStartupMessages({
  library(rvest)
  library(dplyr)
  library(ggplot2)
})

url_data <- "https://www.immd.gov.hk/eng/stat_20220901.html"

page <- url_data %>% read_html()

page %>%
  html_elements("[headers='Arrival']") %>%
  html_text() %>%
  paste("Arrival", .) -> col_names
page %>%
  html_elements("[headers='Departure']") %>%
  html_text() %>%
  paste("Departure", .) %>%
  c(col_names, .) -> col_names
page %>%
  html_elements("[headers='Control_Point']") %>%
  html_text() -> row_names
page %>%
  html_elements("[class='hRight']") %>%
  html_text() %>%
  sub(",", "", .) %>%
  as.numeric() %>%
  matrix(nrow = length(row_names), byrow = TRUE) %>%
  as.data.frame() %>%
  setNames(col_names) %>%
  `row.names<-`(row_names) -> final

final
#>                                Arrival Hong Kong Residents
#> Airport                                               4258
#> Express Rail Link West Kowloon                           0
#> Hung Hom                                                 0
#> Lo Wu                                                    0
#> Lok Ma Chau Spur Line                                    0
#> Heung Yuen Wai                                           0
#> Hong Kong-Zhuhai-Macao Bridge                          333
#> Lok Ma Chau                                              0
#> Man Kam To                                               0
#> Sha Tau Kok                                              0
#> Shenzhen Bay                                          3404
#> China Ferry Terminal                                     0
#> Harbour Control                                          0
#> Kai Tak Cruise Terminal                                  0
#> Macau Ferry Terminal                                     0
#> Total                                                 7995
#>                                Arrival Mainland Visitors Arrival Other Visitors
#> Airport                                             1488                    422
#> Express Rail Link West Kowloon                         0                      0
#> Hung Hom                                               0                      0
#> Lo Wu                                                  0                      0
#> Lok Ma Chau Spur Line                                  0                      0
#> Heung Yuen Wai                                         0                      0
#> Hong Kong-Zhuhai-Macao Bridge                         28                     39
#> Lok Ma Chau                                            0                      0
#> Man Kam To                                             0                      0
#> Sha Tau Kok                                            0                      0
#> Shenzhen Bay                                         348                     37
#> China Ferry Terminal                                   0                      0
#> Harbour Control                                        0                      0
#> Kai Tak Cruise Terminal                                0                      0
#> Macau Ferry Terminal                                   0                      0
#> Total                                               1864                    498
#>                                Arrival Total Departure Hong Kong Residents
#> Airport                                 6168                          3775
#> Express Rail Link West Kowloon             0                             0
#> Hung Hom                                   0                             0
#> Lo Wu                                      0                             0
#> Lok Ma Chau Spur Line                      0                             0
#> Heung Yuen Wai                             0                             0
#> Hong Kong-Zhuhai-Macao Bridge            400                           243
#> Lok Ma Chau                                0                             0
#> Man Kam To                                 0                             0
#> Sha Tau Kok                                0                             0
#> Shenzhen Bay                            3789                          1301
#> China Ferry Terminal                       0                             0
#> Harbour Control                            0                             0
#> Kai Tak Cruise Terminal                    0                             0
#> Macau Ferry Terminal                       0                             0
#> Total                                  10357                          5319
#>                                Departure Mainland Visitors
#> Airport                                               1154
#> Express Rail Link West Kowloon                           0
#> Hung Hom                                                 0
#> Lo Wu                                                    0
#> Lok Ma Chau Spur Line                                    0
#> Heung Yuen Wai                                           0
#> Hong Kong-Zhuhai-Macao Bridge                          194
#> Lok Ma Chau                                              0
#> Man Kam To                                               0
#> Sha Tau Kok                                              0
#> Shenzhen Bay                                           524
#> China Ferry Terminal                                     0
#> Harbour Control                                          0
#> Kai Tak Cruise Terminal                                  0
#> Macau Ferry Terminal                                     0
#> Total                                                 1872
#>                                Departure Other Visitors Departure Total
#> Airport                                             315            5244
#> Express Rail Link West Kowloon                        0               0
#> Hung Hom                                              0               0
#> Lo Wu                                                 0               0
#> Lok Ma Chau Spur Line                                 0               0
#> Heung Yuen Wai                                        0               0
#> Hong Kong-Zhuhai-Macao Bridge                        15             452
#> Lok Ma Chau                                           0               0
#> Man Kam To                                            0               0
#> Sha Tau Kok                                           0               0
#> Shenzhen Bay                                         28            1853
#> China Ferry Terminal                                  0               0
#> Harbour Control                                       0               0
#> Kai Tak Cruise Terminal                               0               0
#> Macau Ferry Terminal                                  0               0
#> Total                                               358            7549

Created on 2022-09-18 with reprex v2.0.2


To plot the pie chart in ggplot plot a bar chart then change to polar coordinates.

Airport <- final[1,,]
Airport %>%
  t() %>%
  as.data.frame() %>%
  mutate(`Arrival/Departure` = row.names(.)) %>%
  ggplot(aes("", Airport, fill = `Arrival/Departure`)) +
  geom_col(width = 1) +
  scale_fill_manual(values = RColorBrewer::brewer.pal(n = 8, name = "Spectral")) +
  coord_polar(theta = "y", start = 0) +
  theme_void()

Created on 2022-09-18 with reprex v2.0.2

Upvotes: 3

QHarr
QHarr

Reputation: 84465

An alternative would be to select the tbody rows, filtering out the hidden items by attribute, then add in the headers later.

library(rvest)
library(tidyverse)

rows <- read_html("https://www.immd.gov.hk/eng/stat_20220901.html") %>% html_elements(".table-passengerTrafficStat tbody tr")
prefixes <- c("arr", "dep")
cols <- c("Hong Kong Residents", "Mainland Visitors", "Other Visitors", "Total")
headers <- c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())

df <- map_dfr(
  rows,
  function(x) {
    x %>%
      html_elements("td[headers]") %>%
      set_names(headers) %>%
      html_text()
  }
) %>%
  mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer()))

Or somewhat condensed,

library(rvest)
library(tidyverse)

rows <- read_html("https://www.immd.gov.hk/eng/stat_20220901.html") %>% html_elements(".table-passengerTrafficStat tbody tr")
prefixes <- c("arr", "dep")
cols <- c("Hong Kong Residents", "Mainland Visitors", "Other Visitors", "Total")
headers <- c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())

df <- map_dfr(rows, ~ set_names(.x %>% html_elements("td[headers]") %>% html_text(), headers)) %>%
  mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer()))

Upvotes: 2

Related Questions