How to get rid of the error while scraping web in R?

Question

I'm scraping this website and get an error message is tibble columns must have compatible sizes.
What should I do in this case?

library(rvest)
library(tidyverse)

url <- "https://www.zomato.com/tr/toronto/drinks-and-nightlife?page=5"
map_dfr(
  .x = url,
  .f = function(x) {
    tibble(
      url = x,
      place = read_html(x) %>%
        html_nodes("a.result-title.hover_feedback.zred.bold.ln24.fontsize0") %>%
        html_attr("title"),
      price = read_html(x) %>%
        html_nodes("div.res-cost.clearfix span.col-s-11.col-m-12.pl0") %>%
        html_text()
    )
  }
) -> df_zomato

Thanks in advance.

Dave2e · Accepted Answer

The problem is due to every restaurant not having a complete record. In this example the 13th item on the list did not include the price, thus the price vector had 14 items while the place vector had 15 items.

One way to solve this problem is to find the common parent node and then parse those nodes with the html_node() function. html_node() will always return a value even if it is NA.

library(rvest)
library(dplyr)
library(tibble)


url <- "https://www.zomato.com/tr/toronto/drinks-and-nightlife?page=5"
readpage <- function(url){
   #read the page once
   page <-read_html(url)

   #parse out the parent nodes
   results <- page %>% html_nodes("article.search-result")

   #retrieve the place and price from each parent
   place <- results %>% html_node("a.result-title.hover_feedback.zred.bold.ln24.fontsize0") %>%
      html_attr("title")
   price <- results %>% html_node("div.res-cost.clearfix span.col-s-11.col-m-12.pl0") %>%
      html_text()

   #return a tibble/data,frame
   tibble(url, place, price)
}

readpage(url)

Also note in your code example above, you were reading the same page multiple times. This is slow and puts additional load on the server. This could be view as a "denial of service" attack.
It is best to read the page once into memory and then work with that copy.

Update
To answer your question concerning multiple pages. Wrap the above function in a lapply function and then bind the list of returned data frames (or tibbles)

dfs <- lapply(listofurls, function(url){ readpage(url)})
finalanswer <- bind_rows(dfs)

How to get rid of the error while scraping web in R?

Answers (1)

Related Questions