Reputation: 1335
I am trying to use a script to download freshwater fish observations from this database. To start with I am leaving all search terms blank/default. When I download the data manually I get a csv with over 150,000 records, but when I download the data using my script below I only get ~100,000 records.
Comparing the two datasets, there is a column called dataVersion
which, in the manually downloaded dataset has three possible values (V1, V2, V3), whereas the dataset downloaded with my script only includes dataVersion
values that equal V1.
But I cannot see where I should be specifying that I want all entries for dataVersion
, as this isn't one of the available search fields. I'm not sure if this is an issue with my code, or this is intended behaviour of the search portal.
Here is my code as it stands (this downloads ~100,000 records), the code should run in less than 2 minutes.
# get web html info
get_doc <- function() {
gr <- httr::GET("https://nzffdms.niwa.co.nz/search")
xml2::read_html(httr::content(gr, "text"))
}
# get csrf_token
get_tok <- function() {
xml2::xml_attr(xml2::xml_find_all(
get_doc(),
".//input[@name='sample_search[_token]']"
), "value")
}
# compile search terms
taxon = ""
names(taxon) <- "sample_search[taxon][]"
search_terms <- c(list(
"sample_search[organisation]" = "",
"sample_search[catchment_no_name]" = "",
"sample_search[catchment_name]" = "",
"sample_search[water_body]" = "",
"sample_search[sample_method]" = "",
"sample_search[start_year]" = "1850",
"sample_search[end_year]" = "2100",
"sample_search[download_format]" = "cde",
"sample_search[_token]" = get_tok()), taxon)
# run search
r <- httr::POST("https://nzffdms.niwa.co.nz/search",
body = search_terms,
encode = "form")
# convert to dataframe
res <- utils::read.csv(text = httr::content(r, "text", encoding = "UTF-8"))
If anyone has any suggestions for how I can get the full ~150,000 records to download via my script that would be much appreciated!
Upvotes: 1
Views: 73
Reputation: 8567
The problem is that the parameters that you manually pass to the POST
request are not the same that those passed in the browser.
Below are the parameters passed in the browser:
As you can see, all values are ""
except for the download format. In your code, you pass start_year = 1850
and end_year = 2100
. If we fix search_terms
to match exactly what is passed in the browser, we get the correct number of rows:
# get web html info
get_doc <- function() {
gr <- httr::GET("https://nzffdms.niwa.co.nz/search")
xml2::read_html(httr::content(gr, "text"))
}
# get csrf_token
get_tok <- function() {
xml2::xml_attr(xml2::xml_find_all(
get_doc(),
".//input[@name='sample_search[_token]']"
), "value")
}
# compile search terms
search_terms <- list(
"sample_search[organisation]" = "",
"sample_search[catchment_no_name]" = "",
"sample_search[catchment_name]" = "",
"sample_search[water_body]" = "",
"sample_search[sample_method]" = "",
"sample_search[start_year]" = "",
"sample_search[end_year]" = "",
"sample_search[download_format]" = "cde",
"sample_search[submit]" = "",
"sample_search[_token]" = get_tok())
# run search
r <- httr::POST("https://nzffdms.niwa.co.nz/search",
body = search_terms,
encode = "form")
# convert to dataframe
res <- utils::read.csv(text = httr::content(r, "text", encoding = "UTF-8"))
nrow(res)
#> [1] 154723
head(res)
#> nzffdRecordNumber eventDate eventTime institution waterBody
#> 1 1 1979-06-05 10:30 NIWA Limestone Creek
#> 2 1 1979-06-05 10:30 NIWA Limestone Creek
#> 3 1 1979-06-05 10:30 NIWA Limestone Creek
#> 4 1 1979-06-05 10:30 NIWA Limestone Creek
#> 5 1 1979-06-05 10:30 NIWA Limestone Creek
#> 6 1 1979-06-05 10:30 NIWA Limestone Creek
#> waterBodyType site catchmentNumber catchmentName eastingNZTM northingNZTM
#> 1 Not Entered 691.021 Hinds R 1463229 5157184
#> 2 Not Entered 691.021 Hinds R 1463229 5157184
#> 3 Not Entered 691.021 Hinds R 1463229 5157184
#> 4 Not Entered 691.021 Hinds R 1463229 5157184
#> 5 Not Entered 691.021 Hinds R 1463229 5157184
#> 6 Not Entered 691.021 Hinds R 1463229 5157184
#> minimumElevation distanceOcean samplingMethod
#> 1 480 60 Electric fishing - Type unknown
#> 2 480 60 Electric fishing - Type unknown
#> 3 480 60 Electric fishing - Type unknown
#> 4 480 60 Electric fishing - Type unknown
#> 5 480 60 Electric fishing - Type unknown
#> 6 480 60 Electric fishing - Type unknown
#> samplingProtocol taxonName taxonCommonName totalCount
#> 1 Unknown Galaxias brevipinnis Koaro NA
#> 2 Unknown Galaxias vulgaris Canterbury galaxias NA
#> 3 Unknown Carassius auratus Goldfish NA
#> 4 Unknown Galaxias maculatus Inanga NA
#> 5 Unknown Gobiomorphus breviceps Upland bully NA
#> 6 Unknown Salvelinus fontinalis Brook char NA
#> present soughtNotDetected minLength maxLength dataVersion
#> 1 true false NA NA V1
#> 2 true false NA NA V1
#> 3 true false NA NA V1
#> 4 true false NA NA V1
#> 5 true false NA NA V1
#> 6 true false NA NA V1
Created on 2022-10-03 with reprex v2.0.2
Upvotes: 1