Reputation: 905
(building on my own question and its answer by @astrofunkswag here)
I am webscraping webpages with rvest
and turning the collected data into a dataframe using purrr::map_df
. I run into the problem that map_df
selects only the first element of html tags with multiple elements. Ideally, I would like all elements of a tag to be captured in the resulting dataframe, and the tags with fewer elements to be recycled.
Take the following code:
library(rvest)
library(tidyverse)
urls <- list("https://en.wikipedia.org/wiki/FC_Barcelona",
"https://en.wikipedia.org/wiki/Rome")
h <- urls %>% map(read_html)
out <- h %>% map_df(~{
a <- html_nodes(., "#firstHeading") %>% html_text()
b <- html_nodes(., ".toctext") %>% html_text()
a <- ifelse(length(a) == 0, NA, a)
b <- ifelse(length(b) == 0, NA, b)
df <- tibble(a, b)
})
out
which produces the following output:
> out
# A tibble: 2 x 2
a b
<chr> <chr>
1 FC Barcelona History
2 Rome Etymology
>
This output is not desired, because it includes only the first element of the tags corresponding to b
. In the source webpages, the elements associated to b
are the subtitles of the webpage. The desired output looks more or less like this:
a b
<chr> <chr>
1 FC Barcelona History
2 FC Barcelona 1899–1922: Beginnings
3 FC Barcelona 1923–1957: Rivera, Republic and Civil War
.
.
6 Rome Etymology
7 Rome History
8 Rome Earliest history
.
.
>
Upvotes: 2
Views: 207
Reputation: 389135
From ?ifelse
ifelse returns a value with the same shape as test
For example, see
ifelse(FALSE, 20, 1:5)
#[1] 1
As the length(FALSE)
is 1, only the first value of 1:5
is selected which is 1.
Similarly, when you are doing
ifelse(length(a) == 0, NA, a)
length(length(a) == 0)
is 1 and hence only the first value of a
is returned.
In this case we can use if
instead of ifelse
since we have only one element to check because
if(FALSE) 20 else 1:5 #returns
#[1] 1 2 3 4 5
So it will give you the output by doing
library(tidyverse)
library(rvest)
h %>% map_df(~{
a <- html_nodes(., "#firstHeading") %>% html_text()
b <- html_nodes(., ".toctext") %>% html_text()
a <- if (length(a) == 0) NA else a
b <- if (length(b) == 0) NA else b
tibble(a,b)
})
# a b
# <chr> <chr>
# 1 FC Barcelona History
# 2 FC Barcelona 1899–1922: Beginnings
# 3 FC Barcelona 1923–1957: Rivera, Republic and Civil War
# 4 FC Barcelona 1957–1978: Club de Fútbol Barcelona
# 5 FC Barcelona 1978–2000: Núñez and stabilization
# 6 FC Barcelona The Dream Team era
# 7 FC Barcelona 2000–2008: Exit Núñez, enter Laporta
# 8 FC Barcelona 2008–2012: Guardiola era
# 9 FC Barcelona 2014–present: Bartomeu era
#10 FC Barcelona Support
# … with 78 more rows
Upvotes: 2