flee
flee

Reputation: 1335

Mismatch in results from POST and manual website download

I am trying to use a script to download freshwater fish observations from this database. To start with I am leaving all search terms blank/default. When I download the data manually I get a csv with over 150,000 records, but when I download the data using my script below I only get ~100,000 records.

Comparing the two datasets, there is a column called dataVersion which, in the manually downloaded dataset has three possible values (V1, V2, V3), whereas the dataset downloaded with my script only includes dataVersion values that equal V1.

But I cannot see where I should be specifying that I want all entries for dataVersion, as this isn't one of the available search fields. I'm not sure if this is an issue with my code, or this is intended behaviour of the search portal.

Here is my code as it stands (this downloads ~100,000 records), the code should run in less than 2 minutes.

# get web html info
get_doc <- function() {
  gr <- httr::GET("https://nzffdms.niwa.co.nz/search")
  xml2::read_html(httr::content(gr, "text"))
}

# get csrf_token
get_tok <- function() {
  xml2::xml_attr(xml2::xml_find_all(
    get_doc(),
    ".//input[@name='sample_search[_token]']"
  ), "value")
}

# compile search terms
taxon = ""
names(taxon) <- "sample_search[taxon][]"

search_terms <- c(list(
  "sample_search[organisation]" = "",
  "sample_search[catchment_no_name]" = "",
  "sample_search[catchment_name]" = "",
  "sample_search[water_body]" = "",
  "sample_search[sample_method]" = "",
  "sample_search[start_year]" = "1850",
  "sample_search[end_year]" = "2100",
  "sample_search[download_format]" = "cde",
  "sample_search[_token]" = get_tok()), taxon)


# run search
r <- httr::POST("https://nzffdms.niwa.co.nz/search",
                body = search_terms,
                encode = "form")


# convert to dataframe
res <- utils::read.csv(text = httr::content(r, "text", encoding = "UTF-8"))

If anyone has any suggestions for how I can get the full ~150,000 records to download via my script that would be much appreciated!

Upvotes: 1

Views: 73

Answers (1)

bretauv
bretauv

Reputation: 8567

The problem is that the parameters that you manually pass to the POST request are not the same that those passed in the browser.

Below are the parameters passed in the browser:

enter image description here

As you can see, all values are "" except for the download format. In your code, you pass start_year = 1850 and end_year = 2100. If we fix search_terms to match exactly what is passed in the browser, we get the correct number of rows:

# get web html info
get_doc <- function() {
  gr <- httr::GET("https://nzffdms.niwa.co.nz/search")
  xml2::read_html(httr::content(gr, "text"))
}

# get csrf_token
get_tok <- function() {
  xml2::xml_attr(xml2::xml_find_all(
    get_doc(),
    ".//input[@name='sample_search[_token]']"
  ), "value")
}

# compile search terms
search_terms <- list(
  "sample_search[organisation]" = "",
  "sample_search[catchment_no_name]" = "",
  "sample_search[catchment_name]" = "",
  "sample_search[water_body]" = "",
  "sample_search[sample_method]" = "",
  "sample_search[start_year]" = "",
  "sample_search[end_year]" = "",
  "sample_search[download_format]" = "cde",
  "sample_search[submit]" = "",
  "sample_search[_token]" = get_tok())


# run search
r <- httr::POST("https://nzffdms.niwa.co.nz/search",
                body = search_terms,
                encode = "form")


# convert to dataframe
res <- utils::read.csv(text = httr::content(r, "text", encoding = "UTF-8"))

nrow(res)
#> [1] 154723
head(res)
#>   nzffdRecordNumber  eventDate eventTime institution       waterBody
#> 1                 1 1979-06-05     10:30        NIWA Limestone Creek
#> 2                 1 1979-06-05     10:30        NIWA Limestone Creek
#> 3                 1 1979-06-05     10:30        NIWA Limestone Creek
#> 4                 1 1979-06-05     10:30        NIWA Limestone Creek
#> 5                 1 1979-06-05     10:30        NIWA Limestone Creek
#> 6                 1 1979-06-05     10:30        NIWA Limestone Creek
#>   waterBodyType site catchmentNumber catchmentName eastingNZTM northingNZTM
#> 1   Not Entered              691.021       Hinds R     1463229      5157184
#> 2   Not Entered              691.021       Hinds R     1463229      5157184
#> 3   Not Entered              691.021       Hinds R     1463229      5157184
#> 4   Not Entered              691.021       Hinds R     1463229      5157184
#> 5   Not Entered              691.021       Hinds R     1463229      5157184
#> 6   Not Entered              691.021       Hinds R     1463229      5157184
#>   minimumElevation distanceOcean                  samplingMethod
#> 1              480            60 Electric fishing - Type unknown
#> 2              480            60 Electric fishing - Type unknown
#> 3              480            60 Electric fishing - Type unknown
#> 4              480            60 Electric fishing - Type unknown
#> 5              480            60 Electric fishing - Type unknown
#> 6              480            60 Electric fishing - Type unknown
#>   samplingProtocol              taxonName     taxonCommonName totalCount
#> 1          Unknown   Galaxias brevipinnis               Koaro         NA
#> 2          Unknown      Galaxias vulgaris Canterbury galaxias         NA
#> 3          Unknown      Carassius auratus            Goldfish         NA
#> 4          Unknown     Galaxias maculatus              Inanga         NA
#> 5          Unknown Gobiomorphus breviceps        Upland bully         NA
#> 6          Unknown  Salvelinus fontinalis          Brook char         NA
#>   present soughtNotDetected minLength maxLength dataVersion
#> 1    true             false        NA        NA          V1
#> 2    true             false        NA        NA          V1
#> 3    true             false        NA        NA          V1
#> 4    true             false        NA        NA          V1
#> 5    true             false        NA        NA          V1
#> 6    true             false        NA        NA          V1

Created on 2022-10-03 with reprex v2.0.2

Upvotes: 1

Related Questions