Reputation: 8127
I'm trying my hand at scraping tables from Wikipedia and I'm reaching an impasse. I'm using the squads of the FIFA 2014 World Cup as an example. In this case, I want to extract the list of the participating countries from the table of the contents from the page "2014 FIFA World Cup squads" and store them as a vector. Here's how far I got:
library(tidyverse)
library(rvest)
library(XML)
library(RCurl)
(Countries <- read_html("https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads") %>%
html_node(xpath = '//*[@id="toc"]/ul') %>%
htmlTreeParse() %>%
xmlRoot())
This spits out a bunch of HTML code that I won't copy/paste here. I specifically am looking to extract all lines with the tag <span class="toctext">
such as "Group A", "Brazil", "Cameroon", etc. and have them saved as a vector. What function would make this happen?
Upvotes: 4
Views: 2235
Reputation: 26258
You can read the text from a node using html_text()
url <- "https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads"
toc <- url %>%
read_html() %>%
html_node(xpath = '//*[@id="toc"]') %>%
html_text()
This gives you a single character vector. You can then split on the \n
character to give you the results as a vector (and you can clean out the blanks)
contents <- strsplit(toc, "\n")[[1]]
contents[contents != ""]
# [1] "Contents" "1 Group A" "1.1 Brazil"
# [4] "1.2 Cameroon" "1.3 Croatia" "1.4 Mexico"
# [7] "2 Group B" "2.1 Australia" "2.2 Chile"
# [10] "2.3 Netherlands" "2.4 Spain" "3 Group C"
# [13] "3.1 Colombia" "3.2 Greece" "3.3 Ivory Coast"
# [16] "3.4 Japan" "4 Group D" "4.1 Costa Rica"
# [19] "4.2 England" "4.3 Italy" "4.4 Uruguay"
# ---
# etc
Generally, to read tables in an html document you can use the html_table()
function, but in this case the table of contents isn't read.
url %>%
read_html() %>%
html_table()
Upvotes: 3