Reputation: 1
I have been trying to understand how to use possibly() to wrap a lambda/anonymous function within map_dfr() so that my iterations continue on should an error be encountered. I am currently iterating over a large amount of webpages and using rvest to scrape them, however some are not compiled correctly or do not work. I would simply like to note that error so that I can return to it at a later time while continuing collecting data from the remainder of the webpages. My current code is posted below in addition to what I've tried:
df <- tibble(df, map_dfr(df$link, ~ {
# Replicate Human Input by Forcing Random Pauses
Sys.sleep(runif(1,1,3))
# Read in the html links
url <- .x %>% html_session(user_agent(user_agents)) %>% read_html()
# Full Job Description Text
description <- url %>%
html_elements(xpath = "//div[@id = 'jobDescriptionText']") %>%
html_text() %>% tolower()
description <- as.character(description)
# Hiring Insights
hiring_insights <- url %>%
html_elements(xpath = "//div[@id = 'hiringInsightsSectionRoot']") %>%
html_text() %>% str_extract("#REGEX") %>%
str_extract("#REGEX") %>%
str_trim()
hiring_insights <- as.character(hiring_insights)
### Extract Number of Hires
hiring_insights <- str_trim(str_extract(hiring_insights,"#REGEX"))
hiring_insights <- tolower(hiring_insights)
### Fill in all Missing Values with 1
hiring_insights[which(is.na(hiring_insights))] <- "1"
tibble(description, hiring_insights)
}))
I have tried wrapping the lambda function a few different ways but without success:
# First Attempt
df <- tibble(df, map_dfr(df$link, possibly(~ {——}, otherwise = "error)))
# Second Attempt
df <- tibble(df, map_dfr(df$link, possibly(function(x) {——}, otherwise = "error)))
# Third Attempt
df <- tibble(df, possibly(map_dfr(df$link, ~ {——}), otherwise = "error"))
# Fourth Attempt
df <- tibble(df, possibly(map_dfr(df$link, function(x) {——}), otherwise = "error"))
When writing the function with function(x) rather than with the ~ I update the .x to x within the lambda function when defining the url variable. However with each of these iterations I encountered a bad link and receive the HTTP 403 error, which then stops the iteration and discards all of data scraped from the previous variables. What I would like is to either have a dummy variable which notes whether or not the link was bad and then if it is bad fill in the values for the scraped variables with or simply whatever the otherwise argument is set too. Thank you in advance! I've really hit a wall here
Upvotes: 0
Views: 255
Reputation: 17309
map_dfr()
expects a dataframe or named vector on every iteration. Your otherwise
value isn’t named, so it throws an error. To illustrate:
library(purrr)
vals <- list(1, 2, "bad", 4, 5)
map_dfr(
vals,
possibly(
~ data.frame(x = .x^2),
otherwise = NA_real_
)
)
Error in `dplyr::bind_rows()`:
! Argument 3 must have names.
But if you change otherwise
to return a dataframe:
map_dfr(
vals,
possibly(
~ data.frame(x = .x^2),
otherwise = data.frame(x = NA_real_)
)
)
x
1 1
2 4
3 NA
4 16
5 25
Upvotes: 1