Reputation: 1
I'm trying scrape data from every team that has participated in the World Cup at least once in the past 30 years.
My knowledge of how to use the R package rvest to scrape tables and whatnot from the web is rudimentary at best.
Currently, my code looks like
library(rvest)
library(dplyr)
fifadata <- read_html("http://www.fifa.com/fifa-tournaments/teams/association=BRA/index.html")
fifa_data_html <-
html_nodes(fifadata,
xpath='/html/body/div[1]/div[5]/div/div[4]/div/div[2]/div/div/div[1]/div/table') %>%
html_table(header=FALSE, fill=TRUE)
fifa_data_html
The first table on the webpage is what I want to scrape, but when I run the above code, html_nodes() returns {xml_nodeset (0)}.
Any input into how to go about scraping the table in question properly would be much appreciated.
Upvotes: 0
Views: 255
Reputation: 34703
Here's something. It's quite a mess:
xp = paste0('//li[@class="tbl-cupname"]/',
'div[@class="label-data"]/',
'span[@class="text"][text()="FIFA World Cup™"]/../../',
'following-sibling::li[@class="tbl-appearances"]/',
'div[@class="label-data"]/',
'span[@class="text"]')
fifadata %>% html_nodes(xpath = xp) %>% html_text %>% as.integer
# [1] 20
Let's break down the logic.
The naive query:
fifadata %>% html_nodes(
xpath = '//li[@class="tbl-appearances"]/div[@class="label-data"]/span'
)
Is sufficient to get us the four rows giving the number of appearances in each of the four tournaments listed on this page. If the web designers are merciful, this is sufficient -- just select the first of these from each page you'd like to scrape, and you'll have what you're after.
This is not robust, however -- it will give incorrect results whenever the row order changes, or if the row you want is absent.
The query presented takes care of this.
First, we identify the rows associated with FIFA World Cup. The essential structure there is:
<li class="tbl-cupname">
<div class="label-data">
<span class="text"> n_appearances </span>
</div>
</li>
We use the class
attributes since there are other li
and div
nearby that we want to be sure to exclude. So, we can select the four rows corresponding to the tournaments (FIFA World Cup, FIFA Confederations Cup, FIFA Women's World Cup, and Women's Olympic Football Tournament) with:
fifadata %>% html_nodes(xpath = '//li[@class="tbl-cupname"]')
Eliminating the three tournaments that are irrelevant to your pursuit requires a condition on the <span>
element, hence the rest of the first part:
xp_part_1 = paste0('//li[@class="tbl-cupname"]/',
'div[@class="label-data"]/',
'span[@class="text"][text()="FIFA World Cup™"]')
fifadata %>% html_nodes(xpath = xp_part_1)
This selects the tournament, however, we want the subsequent li
which contains the number of appearances. The core structure we're touching here is:
<li class="tbl-cupname"> </li>
<li class="tbl-appearances"> </li>
Part 1 of the xpath has navigated us down two levels below this li
, however, so we need to "ascend" the nodes with ..
(this is exactly like cd ..
in the Linux terminal to go up a level, so hopefully that's reminiscent).
We then use the following-sibling
syntax to select nodes that are at the same level as the current node, but come subsequently.
Once we're back on the same level as the li
naming the tournament, we can continue with the "naive" query to drill down to the number of appearances.
Upvotes: 1