user3
user3

Reputation: 37

dbhydroR How to Download Information for Multiple Variables, All Stations, All Years

I am trying to download data sets for multiple variables for all active and inactive groundwater wells and groundwater quality in the South Florida Water Management District for all years on record using the dbhydroR package in R Studio. Some variables are from the hydrological and physical data for groundwater, and some variables are from the water quality data for groundwater.

What code can be used to bulk download this information?

Upvotes: 1

Views: 71

Answers (1)

jsta
jsta

Reputation: 3393

tldr; You can do this but it is not really possible without a lot of manual intervention.

For hydro data, you could feed all the sites on the station map (https://www.sfwmd.gov/sites/default/files/documents/em_well_monitor_map.pdf) to get_hydro by feeding it a really long vector of stationids. To do this you would set the stationid argument to a really long vector instead of the two sites shown below:

get_hydro(stationid = c("C-54", "G-561"), 
    category = "GW", freq = "DA", 
    date_min = "1990-01-01", date_max = "1990-02-02", 
    longest = TRUE)

For water quality data, you could feed all the sites on the station map (http://my.sfwmd.gov/WAB/EnvironmentalMonitoring/index.html) along with all the test_name(s) on the table at http://my.sfwmd.gov/dbhydroplsql/show_dbkey_info.show_data_type_info to the get_wq function by expanding the vectors passed to the test_name and station_id arguments:

get_wq(station_id = c("FLAB08", "FLAB09"),
    date_min = "2011-03-01", date_max = "2012-05-01",
    test_name = c("CHLOROPHYLLA-SALINE", "SALINITY"))

In addition to the possibility that this query may break either the dbhydro servers or your local machine, there are several reasons why this may not return the result you expect. I'm afraid that the underlying database is much too messy write code to get all variables for a given station through time in an unsupervised way. One issue is that there is no canonical dataset for variable X at site Y. Instead the database is very messy such that the record for each variable at each site changes through time.

Hydro data

For hydro sites, the data-record is often represented by multiple datasets that have a different id (dbkey, in the case of hydro data). These numbers can change for an unknown number of reasons; Maybe a sampling or laboratory protocol change or who knows. The time periods can overlap or have a gap. For example see the output of:

get_dbkey(stationid = c("C-54"), 
    category = "GW", freq = "DA")

Dbkey Group Data Type Freq Recorder Start Date End Date

1 P0916 C-54 WELL DA MOD1 01-JAN-1978 31-DEC-2013

2 01952 C-54 WELL DA ???? 10-FEB-1951 20-DEC-2016

3 05669 C-54 WELL DA ???? 23-NOV-1976 11-APR-1977

4 06584 C-54 WELL DA ???? 31-MAR-1977 03-OCT-1978

dbhydroR does not have sufficient logic to select the "correct" dbkey or concatenate the results of several "correct" dbkeys. It can be roughly automated by using the longest argument to pick the longest period-of-record dataset for each variable x site combination (the longest argument is new and is not on CRAN, it's only on Github for now):

get_dbkey(stationid = c("C-54"), 
    category = "GW", freq = "DA", longest = TRUE)

Dbkey Group Data Type Freq Recorder Start Date End Date

2 01952 C-54 WELL DA ???? 10-FEB-1951 20-DEC-2016

Water quality data

Similar messiness issues occur in the water quality data. There are no cleaned ready-to-go (curated) records of each variable at each site. Instead, the period-of-record gets split each time a sampling or measurement protocol changes. For example, look at how chlorophyll-A is represented by three different test names. Frustratingly, there is no way to tell what the period-of-record is for this data until you attempt to pull it. You would also have decide how to deal with quality assurance flags.

Upvotes: 1

Related Questions