Turning a table in HTML into a data frame

Question

I'm trying my hand at scraping tables from Wikipedia and I'm reaching an impasse. I'm using the squads of the FIFA 2014 World Cup as an example. In this case, I want to extract the list of the participating countries from the table of the contents from the page "2014 FIFA World Cup squads" and store them as a vector. Here's how far I got:

library(tidyverse)
library(rvest)
library(XML)
library(RCurl)

(Countries <- read_html("https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads") %>% 
  html_node(xpath = '//*[@id="toc"]/ul') %>% 
  htmlTreeParse() %>%
  xmlRoot())

This spits out a bunch of HTML code that I won't copy/paste here. I specifically am looking to extract all lines with the tag such as "Group A", "Brazil", "Cameroon", etc. and have them saved as a vector. What function would make this happen?

SymbolixAU · Accepted Answer

You can read the text from a node using html_text()

url <- "https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads"
toc <- url %>%
    read_html() %>%
    html_node(xpath = '//*[@id="toc"]') %>%
    html_text()

This gives you a single character vector. You can then split on the character to give you the results as a vector (and you can clean out the blanks)

contents <- strsplit(toc, "
")[[1]]

contents[contents != ""]

# [1] "Contents"                                   "1 Group A"                                  "1.1 Brazil"                                
# [4] "1.2 Cameroon"                               "1.3 Croatia"                                "1.4 Mexico"                                
# [7] "2 Group B"                                  "2.1 Australia"                              "2.2 Chile"                                 
# [10] "2.3 Netherlands"                            "2.4 Spain"                                  "3 Group C"                                 
# [13] "3.1 Colombia"                               "3.2 Greece"                                 "3.3 Ivory Coast"                           
# [16] "3.4 Japan"                                  "4 Group D"                                  "4.1 Costa Rica"                            
# [19] "4.2 England"                                "4.3 Italy"                                  "4.4 Uruguay"                               
# ---
# etc

Generally, to read tables in an html document you can use the html_table() function, but in this case the table of contents isn't read.

url %>% 
    read_html() %>%
    html_table()

Turning a table in HTML into a data frame

Answers (1)

Related Questions