AML
AML

Reputation: 35

Using RSelenium for web scraping across Windows and Mac

I am new to web scraping, and I have successfully written functions for multiple different types of sites that gather the information I desire into a data frame. That being said, however, these functions were developed using RSelenium on a Mac. When attempting to run the very same functions on my Windows PC, they fail.

I assume the issue is related to how RSelenium is set up to launch. Below is how I launch RSelenium in each of my different web scrape functions:

rs <- rsDriver(browser = "firefox", port = netstat::free_port())
remote <- rs$client
remote$navigate(url)

Upon executing this R code within a function, this is the resulting error I receive when specifying Firefox as the browser for RSelenium to use:

Could not open firefox browser.
Client error message:
Undefined error in httr call. httr output: Failed to connect to localhost port 14415: Connection refused
Check server log for further details.
Error in checkError(res) :
 Undefined error in httr call. httr output: length(url) == 1 is not TRUE
In addition: Warning message:
In rsDriver(browser = "firefox", port = netstat::free_port()) :
Error in checkError(res) :
 Undefined error in httr call. httr output: length(url) == 1 is not TRUE

This is the resulting error I receive when specifying Google Chrome as the browser for RSelenium to use:

Could not open chrome browser.
Client error message:
Undefined error in httr call. httr output: Failed to connect to localhost port 14415: Connection refused
Check server log for further details.
Error in checkError(res) :
 Undefined error in httr call. httr output: length(url) == 1 is not TRUE
In addition: Warning message:
In rsDriver(browser = "chrome", port = netstat::free_port()) :
Error in checkError(res) :
 Undefined error in httr call. httr output: length(url) == 1 is not TRUE

Again, these errors only appear in the Windows PC environment, and not on my Mac.

To attempt to resolve this issue, I have scoured Stack Overflow for possible answers. I have switched out the port to use, with 4444L for example, and to no avail. It results in the same error.

When running sessionInfo() on my Windows PC, it produces the following output:

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] netstat_0.1.1    assertthat_0.2.1 jsonlite_1.7.2   webdriver_1.0.6  V8_3.4.2         RSelenium_1.7.7  rvest_1.0.0     
 [8] lubridate_1.7.10 forcats_0.5.1    stringr_1.4.0    dplyr_1.0.6      purrr_0.3.4      readr_1.4.0      tidyr_1.1.3     
[15] tibble_3.1.2     ggplot2_3.3.3    tidyverse_1.3.1 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6       binman_0.1.2     png_0.1-7        ps_1.6.0         utf8_1.2.1       showimage_1.0.0  R6_2.5.0        
 [8] cellranger_1.1.0 backports_1.2.1  reprex_2.0.0     httr_1.4.2       pillar_1.6.1     rlang_0.4.11     curl_4.3.1      
[15] readxl_1.3.1     rstudioapi_0.13  callr_3.7.0      wdman_0.2.5      munsell_0.5.0    broom_0.7.6      compiler_4.1.0  
[22] modelr_0.1.8     janitor_2.1.0    pkgconfig_2.0.3  askpass_1.1      base64enc_0.1-3  openssl_1.4.4    tidyselect_1.1.1
[29] XML_3.99-0.6     fansi_0.5.0      crayon_1.4.1     dbplyr_2.1.1     withr_2.4.2      bitops_1.0-7     grid_4.1.0      
[36] gtable_0.3.0     lifecycle_1.0.0  DBI_1.1.1        magrittr_2.0.1   semver_0.2.0     scales_1.1.1     debugme_1.1.0   
[43] cli_2.5.0        stringi_1.6.2    fs_1.5.0         snakecase_0.11.0 xml2_1.3.2       ellipsis_0.3.2   generics_0.1.0  
[50] vctrs_0.3.8      tools_4.1.0      glue_1.4.2       hms_1.1.0        processx_3.5.2   colorspace_2.0-1 caTools_1.18.2  
[57] haven_2.4.1     

How can I get RSelenium working cross-device?

Upvotes: 2

Views: 929

Answers (1)

Steffen Moritz
Steffen Moritz

Reputation: 7730

In general these issues seem to be quite common.

You can't really tell from your description what is the issue for you. Possible issues are JAVA/browser version, selenium version, ... reasons can be manifold - guess you also have figured this out doing your research about the error.

Possible Quick Fix

What you can try first is installing a recent Java Development Kit (JDK) version. Also make sure JAVA_HOME environment is set e.g.:

Sys.setenv(JAVA_HOME="C:/Program Files/Java/jdk-15.0.1/")

I would try this as a quick fix. Otherwise, as stated the error might also be caused by specific combinations of java / selenium, ... versions. Might be cumbersome to find the right fit of versions - that is why I would do the following if this quick fix doesn't work:

Using Docker

If this does not work you probably should switch to using Docker. This is also, what the Rselenium developers more or less recommend:

Running a docker container standardises the build across OS’s and removes many of the issues user may have relating to JAVA/browser version/selenium version etc.

Here is a link to the package documentation, where they explain, how to use Docker for RSelenium. (you install docker and they provide docker images, that are configured, such that Rselenium should work without issues)

Upvotes: 5

Related Questions