Reputation: 1367
I ran into an issue where mutate()
in dplyr returns results in the wrong order. My call to mutate
uses data from an existing column as input, but the results are returned arranged as if the data was sorted before the mutate
.
My specific issue uses the dataRetrieval
package to get USGS/NWIS data from the web. In this example, I am retrieving the site names based on the site ID. In the `dataRetreival package, site ID is a numerical code stored as a character.
library(dataRetrieval)
library(dplyr)
Gauges <- tibble( Name = c("Twisp", "Chewuch", "Andrews" ,"Met@Winthrop", "Met@Twisp", "Met@Pateros", "Met@Goat"),
ID = c("12448998" , "12448000","12447390", "12448500" ,"12449500","12449950" , "12447383")
)
## This works correctly with each of the station numbers
readNWISsite(Gauges$ID[1])$station_nm
# [1] "TWISP RIVER NEAR TWISP, WA"
## This does not work correctly
## Order is not right! Station does not correspond with ID !!
Gauges%>%
mutate(Station = readNWISsite(ID)$station_nm)
# # A tibble: 7 x 3
# Name ID Station
# <chr> <chr> <chr>
# 1 Twisp 12448998 METHOW RIVER ABOVE GOAT CREEK NEAR MAZAMA, WA
# 2 Chewuch 12448000 ANDREWS CREEK NEAR MAZAMA, WA
# 3 Andrews 12447390 CHEWUCH RIVER AT WINTHROP, WA
# 4 Met@Winthrop 12448500 METHOW RIVER AT WINTHROP, WA
# 5 Met@Twisp 12449500 TWISP RIVER NEAR TWISP, WA
# 6 Met@Pateros 12449950 METHOW RIVER AT TWISP, WA
# 7 Met@Goat 12447383 METHOW RIVER NEAR PATEROS, WA
## This works, returning the correct site associated with the gauge number
Gauges%>%
arrange(ID) %>%
mutate(Station = readNWISsite(ID)$station_nm)
# # A tibble: 7 x 3
# Name ID Station
# <chr> <chr> <chr>
# 1 Met@Goat 12447383 METHOW RIVER ABOVE GOAT CREEK NEAR MAZAMA, WA
# 2 Andrews 12447390 ANDREWS CREEK NEAR MAZAMA, WA
# 3 Chewuch 12448000 CHEWUCH RIVER AT WINTHROP, WA
# 4 Met@Winthrop 12448500 METHOW RIVER AT WINTHROP, WA
# 5 Twisp 12448998 TWISP RIVER NEAR TWISP, WA
# 6 Met@Twisp 12449500 METHOW RIVER AT TWISP, WA
# 7 Met@Pateros 12449950 METHOW RIVER NEAR PATEROS, WA
Why is mutate
rearranging the data in the middle of the process? Alternatively, what is going on here?
Upvotes: 3
Views: 193
Reputation: 887961
To understand what is happening, instead of extracting only the 'station_nm', get the 'site_no' as well
library(dplyr)
library(dataRetrieval)
readNWISsite(Gauges$ID)[c('site_no', 'station_nm')]
#site_no station_nm
#1 12447383 METHOW RIVER ABOVE GOAT CREEK NEAR MAZAMA, WA
#2 12447390 ANDREWS CREEK NEAR MAZAMA, WA
#3 12448000 CHEWUCH RIVER AT WINTHROP, WA
#4 12448500 METHOW RIVER AT WINTHROP, WA
#5 12448998 TWISP RIVER NEAR TWISP, WA
#6 12449500 METHOW RIVER AT TWISP, WA
#7 12449950 METHOW RIVER NEAR PATEROS, WA
Here, the 'site_no' is ordered based on the integer values of 'ID'. To correct this, we can either apply the function on each 'ID' one at a time with rowwise
Gauges %>%
rowwise() %>%
mutate(Station = readNWISsite(ID)$station_nm)
or map
from purrr
library(purrr)
Gauges %>%
mutate(Station = map_chr(ID, ~ readNWISsite(.x)$station_nm))
# A tibble: 7 x 3
# Name ID Station
# <chr> <chr> <chr>
#1 Twisp 12448998 TWISP RIVER NEAR TWISP, WA
#2 Chewuch 12448000 CHEWUCH RIVER AT WINTHROP, WA
#3 Andrews 12447390 ANDREWS CREEK NEAR MAZAMA, WA
#4 Met@Winthrop 12448500 METHOW RIVER AT WINTHROP, WA
#5 Met@Twisp 12449500 METHOW RIVER AT TWISP, WA
#6 Met@Pateros 12449950 METHOW RIVER NEAR PATEROS, WA
#7 Met@Goat 12447383 METHOW RIVER ABOVE GOAT CREEK NEAR MAZAMA, WA
Or we extract both columns and do a match
with the 'ID' and 'site_no'
Gauges %>%
mutate(Station = {
tmp <- readNWISsite(ID)[c('site_no', 'station_nm')]
tmp$station_nm[match(ID, tmp$site_no)]})
# A tibble: 7 x 3
# Name ID Station
# <chr> <chr> <chr>
#1 Twisp 12448998 TWISP RIVER NEAR TWISP, WA
#2 Chewuch 12448000 CHEWUCH RIVER AT WINTHROP, WA
#3 Andrews 12447390 ANDREWS CREEK NEAR MAZAMA, WA
#4 Met@Winthrop 12448500 METHOW RIVER AT WINTHROP, WA
#5 Met@Twisp 12449500 METHOW RIVER AT TWISP, WA
#6 Met@Pateros 12449950 METHOW RIVER NEAR PATEROS, WA
#7 Met@Goat 12447383 METHOW RIVER ABOVE GOAT CREEK NEAR MAZAMA, WA
Upvotes: 4