tryCatch function works on most non-existent URLs, but it does not work in (at least) one case

Question

Dear Stackoverflow users,

I am using R to scrape profiles of a few psycotherapists from Psychology Today; this is done for exercising and learning more about web scraping.

I am new to R and I I have to go through this intense training that will help me with a future projects. It implies that I might not know precisely what I am doing at the moment (e.g. I might not interpret well either the script or the error messages from R), but I have to get it done. Therefore, I beg your pardon for possible misunderstandings or inaccuracies.

In short, the situation is the following. I have created a function through which I scrape information from 2 nodes of psycotherapists' profiles; the function is showed on this stackoverflow post.

Then I create a loop where that function is used on a few psycotherapists' profiles; the loop is in the above post as well, but I report it below because that is the part of the script that generates some problems (additionally to what I solved in the above mentioned post).

j <- 1
MHP_codes <-  c(150140:150180) #therapist identifier
df_list <- vector(mode = "list", length(MHP_codes))
  for(code1 in MHP_codes) {
    URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
    #Reading the HTML code from the website
    URL <- read_html(URL)
    df_list[[j]] <- tryCatch(getProfile(URL), 
                             error = function(e) NA)
    j <- j + 1
  }

when the loop is done, I bind the information from different profiles into one data frame and save it.

final_df <- rbind.fill(df_list)
save(final_df,file="final_df.Rda")

The function (getProfile) works well on individual profiles. It works also on a small range of profiles ( c(150100:150150)). Please, note that I do not know what psychoterapist id is actually assigned; so, many URLs within the range do not exist.

However, generally speaking, tryCatch should solve this issue. When an URL is non-existent (and thus the ID is not associated to any psychoterapist), each of the 2 nodes (and thus each of the 2 corresponding variables in my data frame) are empty (i.e. the data frame shows NAs in the corresponding cells).

However, in some IDs ranges, two problems might happen.

First, I get one error message such as teh following one:

Error in open.connection(x, "rb") : HTTP error 404.

So, this happens despite the fact that I am usign tryCatch and despite the fact that it generally appears to work (at least, until the error message appear).

Moreover, after the loop is stopped and R runs the line:

final_df <- rbind.fill(df_list)

A second error message appears:

Warning message: In df[[var]] : closing unused connection 3 (https://www.psychologytoday.com/us/therapists/illinois/150152)

It seems like there is a specific problem with that one empty URL. In fact, when I change ID range, the loop works well despite non-existent URLs: on one hand, when the URL exists the information is scraped from the website, on the other hand, when the URL does not exists, the 2 variables associated to that URL (and thus to that psyciotherapist ID) get an NA.

Is it possible, perhaps, to tell R to skip the URL if it is empty? Without recording anything? This solution would be excellent, since it would shrink the data frame to the existing URLs, but I do not know how to do it and I do not know whether it is a solution to my problem.

Anyone who is able to help me sorting out this issue?

Jul · Accepted Answer

Yes, you need to wrap a tryCatch around the read_html call. This is where R tries to connect to the website, so it will throw an error (as opposed to returning an empty object) there if fails to connect. You can catch that error and then use next to tell R to skip to the next iteration of the loop.

library(rvest)
##Valid URL, works fine
URL <- "https://news.bbc.co.uk"
read_html(URL)

##Invalid URL, error raised
URL <- "https://news.bbc.co.uk/not_exist"
read_html(URL)
##Leads to error
Error in open.connection(x, "rb") : HTTP error 404.

##Invalid URL, catch and skip to next iteration of the loop
URL <- "https://news.bbc.co.uk/not_exist"
tryCatch({
URL <- read_html(URL)},
error=function(e) {print("URL Not Found, skipping")
                  next})

tryCatch function works on most non-existent URLs, but it does not work in (at least) one case

Answers (2)

Related Questions