Nick
Nick

Reputation: 1

How to use purr::possibly() with purr::map_dfr() to continue webscraping links with rvest when encountering an error for a bad link (HTTP Error 403)

I have been trying to understand how to use possibly() to wrap a lambda/anonymous function within map_dfr() so that my iterations continue on should an error be encountered. I am currently iterating over a large amount of webpages and using rvest to scrape them, however some are not compiled correctly or do not work. I would simply like to note that error so that I can return to it at a later time while continuing collecting data from the remainder of the webpages. My current code is posted below in addition to what I've tried:

df <- tibble(df, map_dfr(df$link, ~ {
  
  # Replicate Human Input by Forcing Random Pauses
  Sys.sleep(runif(1,1,3)) 
  
  # Read in the html links
  url <- .x %>% html_session(user_agent(user_agents)) %>% read_html()
  
  # Full Job Description Text
  description <- url %>% 
    html_elements(xpath = "//div[@id = 'jobDescriptionText']") %>%
    html_text() %>% tolower()
  description <- as.character(description)   
  
  # Hiring Insights
  hiring_insights <- url %>% 
    html_elements(xpath = "//div[@id = 'hiringInsightsSectionRoot']") %>% 
    html_text() %>% str_extract("#REGEX") %>% 
    str_extract("#REGEX") %>% 
    str_trim() 
  hiring_insights <- as.character(hiring_insights)
  ### Extract Number of Hires 
  hiring_insights <- str_trim(str_extract(hiring_insights,"#REGEX"))
  hiring_insights <- tolower(hiring_insights)
  ### Fill in all Missing Values with 1 
  hiring_insights[which(is.na(hiring_insights))] <- "1"
  tibble(description, hiring_insights)
}))

I have tried wrapping the lambda function a few different ways but without success:

# First Attempt
df <- tibble(df, map_dfr(df$link, possibly(~ {——}, otherwise = "error))) 
# Second Attempt
df <- tibble(df, map_dfr(df$link, possibly(function(x) {——}, otherwise = "error))) 
# Third Attempt 
df <- tibble(df, possibly(map_dfr(df$link, ~ {——}), otherwise = "error"))
# Fourth Attempt
df <- tibble(df, possibly(map_dfr(df$link, function(x) {——}), otherwise = "error"))

When writing the function with function(x) rather than with the ~ I update the .x to x within the lambda function when defining the url variable. However with each of these iterations I encountered a bad link and receive the HTTP 403 error, which then stops the iteration and discards all of data scraped from the previous variables. What I would like is to either have a dummy variable which notes whether or not the link was bad and then if it is bad fill in the values for the scraped variables with or simply whatever the otherwise argument is set too. Thank you in advance! I've really hit a wall here

Upvotes: 0

Views: 255

Answers (1)

zephryl
zephryl

Reputation: 17309

map_dfr() expects a dataframe or named vector on every iteration. Your otherwise value isn’t named, so it throws an error. To illustrate:

library(purrr)

vals <- list(1, 2, "bad", 4, 5)

map_dfr(
  vals, 
  possibly(
    ~ data.frame(x = .x^2), 
    otherwise = NA_real_
  )
)
Error in `dplyr::bind_rows()`:
! Argument 3 must have names.

But if you change otherwise to return a dataframe:

map_dfr(
  vals, 
  possibly(
    ~ data.frame(x = .x^2), 
    otherwise = data.frame(x = NA_real_)
  )
)
  x
1  1
2  4
3 NA
4 16
5 25

Upvotes: 1

Related Questions