Jose David
Jose David

Reputation: 139

Arguments imply differing number of rows when scraping a website

I am getting this error when running my code:

Error in data.frame(date = html_text(html_nodes(pagina, ".node-post-date")),  : 
  arguments imply differing number of rows: 9, 10

When scraping the tag in the page 983, I only get 9 results (instead of the usual 10 results for each page). I think this is happening because in that web page one of the dates I want to scrape has a different tag to the one I am using.

I am quite new to R so I do not know how to run an if statement in my code to get an NA for the result I am not getting.

Here it is my code:

#Libraries
library(rvest)
library(purrr)
library(tidytext)
library(dplyr)

url_espectador <- 'https://www.elespectador.com/search/farc?page=%d&sort=created&order=desc'

map_df(980:990, function(i) {

  pagina <- read_html(sprintf(url_espectador, i, '%s', '%s', '%s', '%s'))
  print(i)

  data.frame(title = html_text(html_nodes(pagina, ".node-title a")),
             date = html_text(html_nodes(pagina, ".node-post-date")),
             link = paste0("https://www.elespectador.com", str_trim(html_attr(html_nodes(pagina, ".node-title a"), "href"))),
             stringsAsFactors=FALSE)
  }) -> noticias_espectador

Besides the if statement, is there any other solution to this? I am going to scrape a large number of pages so I need to avoid this row matching problem. Thanks for your help!

Upvotes: 1

Views: 564

Answers (1)

QHarr
QHarr

Reputation: 84465

You could use css Or syntax to add the other class (suitable when small number of additional classes).

Alternatively, you could select for a shared parent node, test if a particular child is present, return NA if not. This answer shows you the latter approach. If you use the latter a suitable parent node can be got with selector .node--search-result - you may miss the actual child of interest (as in this case where different class) - but code won't error out.

There is a third option - the classes have a common suffix, in cases observed, so you could use an attribute = value css selector, with either contains (*), or ends with ($) operator e.g. date = html_text(html_nodes(pagina, "[class$='post-date']")).

library(rvest)
library(purrr)
library(tidytext)
library(dplyr)

url_espectador <- 'https://www.elespectador.com/search/farc?page=%d&sort=created&order=desc'

map_df(980:990, function(i) {

  pagina <- read_html(sprintf(url_espectador, i, '%s', '%s', '%s', '%s'))
  print(i)

  data.frame(title = html_text(html_nodes(pagina, ".node-title a")),
             date = html_text(html_nodes(pagina, ".node-post-date, .field--name-post-date")),
             link = paste0("https://www.elespectador.com", str_trim(html_attr(html_nodes(pagina, ".node-title a"), "href"))),
             stringsAsFactors=FALSE)
}) -> noticias_espectador

Upvotes: 1

Related Questions