Reputation: 139
I am getting this error when running my code:
Error in data.frame(date = html_text(html_nodes(pagina, ".node-post-date")), :
arguments imply differing number of rows: 9, 10
When scraping the tag in the page 983, I only get 9 results (instead of the usual 10 results for each page). I think this is happening because in that web page one of the dates I want to scrape has a different tag to the one I am using.
I am quite new to R so I do not know how to run an if statement in my code to get an NA for the result I am not getting.
Here it is my code:
#Libraries
library(rvest)
library(purrr)
library(tidytext)
library(dplyr)
url_espectador <- 'https://www.elespectador.com/search/farc?page=%d&sort=created&order=desc'
map_df(980:990, function(i) {
pagina <- read_html(sprintf(url_espectador, i, '%s', '%s', '%s', '%s'))
print(i)
data.frame(title = html_text(html_nodes(pagina, ".node-title a")),
date = html_text(html_nodes(pagina, ".node-post-date")),
link = paste0("https://www.elespectador.com", str_trim(html_attr(html_nodes(pagina, ".node-title a"), "href"))),
stringsAsFactors=FALSE)
}) -> noticias_espectador
Besides the if statement, is there any other solution to this? I am going to scrape a large number of pages so I need to avoid this row matching problem. Thanks for your help!
Upvotes: 1
Views: 564
Reputation: 84465
You could use css Or syntax to add the other class (suitable when small number of additional classes).
Alternatively, you could select for a shared parent node, test if a particular child is present, return NA if not. This answer shows you the latter approach. If you use the latter a suitable parent node can be got with selector .node--search-result
- you may miss the actual child of interest (as in this case where different class) - but code won't error out.
There is a third option - the classes have a common suffix, in cases observed, so you could use an attribute = value css selector, with either contains
(*), or ends with
($) operator e.g. date = html_text(html_nodes(pagina, "[class$='post-date']"))
.
library(rvest)
library(purrr)
library(tidytext)
library(dplyr)
url_espectador <- 'https://www.elespectador.com/search/farc?page=%d&sort=created&order=desc'
map_df(980:990, function(i) {
pagina <- read_html(sprintf(url_espectador, i, '%s', '%s', '%s', '%s'))
print(i)
data.frame(title = html_text(html_nodes(pagina, ".node-title a")),
date = html_text(html_nodes(pagina, ".node-post-date, .field--name-post-date")),
link = paste0("https://www.elespectador.com", str_trim(html_attr(html_nodes(pagina, ".node-title a"), "href"))),
stringsAsFactors=FALSE)
}) -> noticias_espectador
Upvotes: 1