Reputation: 515
I would like to scrap the results of this website like a normal searching result.
The code I have is the following but it saves a local copy of the html file, and I would like to change it to a function and implement on a package, doing it without saving a copy.
EDIT: It works on Mac OS and Linux. I just wanted a way to do it on Windows because for a package it must work on all three OS.
search_cas_species <- function(species, path = getwd()) {
url <- "https://researcharchive.calacademy.org/research/ichthyology/catalog/fishcatmain.asp"
page_initial <- httr::GET(url)
content_initial <- httr::content(page_initial)
POST_safe <- purrr::safely(httr::POST)
data_cas_species <- list(
"tbl" = "Species",
"contains" = species,
"Submit" = "Search"
)
if(!dir.exists(path)) dir.create(path)
species_clean <- stringr::str_replace_all(species, '[:blank:]', '_')
html_name <- paste0(species_clean, ".html")
html_path <- file.path(path, html_name)
search_page <- POST_safe(
url = url,
body = data_cas_species,
encode = "form",
write_disk(html_path, overwrite = TRUE)
)
return(html_name)
}
respostas <- search_cas_species("Cichla")
respostas %>%
rvest::read_html() %>%
xml2::xml_find_all(".//p[@class='result']") %>%
`[`(-1) %>%
`[`(c(FALSE, TRUE))
I have already tested the following, but it gives me an error.
library(dplyr)
url <- "https://researcharchive.calacademy.org/research/ichthyology/catalog/fishcatmain.asp"
page_initial <- httr::GET(url)
content_initial <- httr::content(page_initial)
#> No encoding supplied: defaulting to UTF-8.
data_cas_species <- list(
"tbl" = "Species",
"contains" = "Cichla",
"Submit" = "Search"
)
search_page <- httr::POST(
url = url,
body = data_cas_species,
encode = "form"
)
#> Error in curl::curl_fetch_memory(url, handle = handle): Failure when receiving data from the peer
Created on 2022-07-12 by the reprex package (v2.0.1)
sessioninfo::platform_info()
#> setting value
#> version R version 4.1.1 (2021-08-10)
#> os Windows 10 x64 (build 19044)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate Portuguese_Brazil.1252
#> ctype Portuguese_Brazil.1252
#> tz America/Sao_Paulo
#> date 2022-07-12
#> pandoc 2.14.0.3 @ C:/Program Files/RStudio/bin/pandoc/ (via rmarkdown)
Upvotes: 2
Views: 222
Reputation: 11
Please check the R package rFishTaxa
, which can meet your requirements.
devtools::install_github("Otoliths/rFishTaxa", build_vignettes = TRUE)
library("rFishTaxa")
browseVignettes('rFishTaxa')
Upvotes: 1