Brian Fisher
Brian Fisher

Reputation: 1367

Mutate returns data in wrong order dplyr is this a bug?

I ran into an issue where mutate() in dplyr returns results in the wrong order. My call to mutate uses data from an existing column as input, but the results are returned arranged as if the data was sorted before the mutate.

My specific issue uses the dataRetrieval package to get USGS/NWIS data from the web. In this example, I am retrieving the site names based on the site ID. In the `dataRetreival package, site ID is a numerical code stored as a character.

library(dataRetrieval)
library(dplyr)

Gauges <- tibble( Name = c("Twisp", "Chewuch", "Andrews" ,"Met@Winthrop", "Met@Twisp", "Met@Pateros", "Met@Goat"),
                  ID = c("12448998" , "12448000","12447390", "12448500" ,"12449500","12449950" , "12447383")
)

## This works correctly with each of the station numbers
readNWISsite(Gauges$ID[1])$station_nm
# [1] "TWISP RIVER NEAR TWISP, WA"

## This does not work correctly
## Order is not right! Station does not correspond with ID  !!
Gauges%>%
      mutate(Station = readNWISsite(ID)$station_nm)

# # A tibble: 7 x 3
# Name         ID       Station                                      
# <chr>        <chr>    <chr>                                        
# 1 Twisp        12448998 METHOW RIVER ABOVE GOAT CREEK NEAR MAZAMA, WA
# 2 Chewuch      12448000 ANDREWS CREEK NEAR MAZAMA, WA                
# 3 Andrews      12447390 CHEWUCH RIVER AT WINTHROP, WA                
# 4 Met@Winthrop 12448500 METHOW RIVER AT WINTHROP, WA                 
# 5 Met@Twisp    12449500 TWISP RIVER NEAR TWISP, WA                   
# 6 Met@Pateros  12449950 METHOW RIVER AT TWISP, WA                    
# 7 Met@Goat     12447383 METHOW RIVER NEAR PATEROS, WA    

## This works, returning the correct site associated with the gauge number
Gauges%>%
      arrange(ID) %>%
      mutate(Station = readNWISsite(ID)$station_nm)
# # A tibble: 7 x 3
# Name         ID       Station                                      
# <chr>        <chr>    <chr>                                        
# 1 Met@Goat     12447383 METHOW RIVER ABOVE GOAT CREEK NEAR MAZAMA, WA
# 2 Andrews      12447390 ANDREWS CREEK NEAR MAZAMA, WA                
# 3 Chewuch      12448000 CHEWUCH RIVER AT WINTHROP, WA                
# 4 Met@Winthrop 12448500 METHOW RIVER AT WINTHROP, WA                 
# 5 Twisp        12448998 TWISP RIVER NEAR TWISP, WA                   
# 6 Met@Twisp    12449500 METHOW RIVER AT TWISP, WA                    
# 7 Met@Pateros  12449950 METHOW RIVER NEAR PATEROS, WA  

Why is mutate rearranging the data in the middle of the process? Alternatively, what is going on here?

Upvotes: 3

Views: 193

Answers (1)

akrun
akrun

Reputation: 887961

To understand what is happening, instead of extracting only the 'station_nm', get the 'site_no' as well

library(dplyr)
library(dataRetrieval)
readNWISsite(Gauges$ID)[c('site_no', 'station_nm')]
#site_no                                    station_nm
#1 12447383 METHOW RIVER ABOVE GOAT CREEK NEAR MAZAMA, WA
#2 12447390                 ANDREWS CREEK NEAR MAZAMA, WA
#3 12448000                 CHEWUCH RIVER AT WINTHROP, WA
#4 12448500                  METHOW RIVER AT WINTHROP, WA
#5 12448998                    TWISP RIVER NEAR TWISP, WA
#6 12449500                     METHOW RIVER AT TWISP, WA
#7 12449950                 METHOW RIVER NEAR PATEROS, WA

Here, the 'site_no' is ordered based on the integer values of 'ID'. To correct this, we can either apply the function on each 'ID' one at a time with rowwise

Gauges %>% 
    rowwise() %>% 
    mutate(Station = readNWISsite(ID)$station_nm)

or map from purrr

library(purrr)
Gauges %>%
    mutate(Station = map_chr(ID, ~ readNWISsite(.x)$station_nm))
# A tibble: 7 x 3
#  Name         ID       Station                                      
#  <chr>        <chr>    <chr>                                        
#1 Twisp        12448998 TWISP RIVER NEAR TWISP, WA                   
#2 Chewuch      12448000 CHEWUCH RIVER AT WINTHROP, WA                
#3 Andrews      12447390 ANDREWS CREEK NEAR MAZAMA, WA                
#4 Met@Winthrop 12448500 METHOW RIVER AT WINTHROP, WA                 
#5 Met@Twisp    12449500 METHOW RIVER AT TWISP, WA                    
#6 Met@Pateros  12449950 METHOW RIVER NEAR PATEROS, WA                
#7 Met@Goat     12447383 METHOW RIVER ABOVE GOAT CREEK NEAR MAZAMA, WA

Or we extract both columns and do a match with the 'ID' and 'site_no'

Gauges %>% 
          mutate(Station = {
           tmp <- readNWISsite(ID)[c('site_no', 'station_nm')]
              tmp$station_nm[match(ID, tmp$site_no)]})
# A tibble: 7 x 3
#  Name         ID       Station                                      
#  <chr>        <chr>    <chr>                                        
#1 Twisp        12448998 TWISP RIVER NEAR TWISP, WA                   
#2 Chewuch      12448000 CHEWUCH RIVER AT WINTHROP, WA                
#3 Andrews      12447390 ANDREWS CREEK NEAR MAZAMA, WA                
#4 Met@Winthrop 12448500 METHOW RIVER AT WINTHROP, WA                 
#5 Met@Twisp    12449500 METHOW RIVER AT TWISP, WA                    
#6 Met@Pateros  12449950 METHOW RIVER NEAR PATEROS, WA                
#7 Met@Goat     12447383 METHOW RIVER ABOVE GOAT CREEK NEAR MAZAMA, WA

Upvotes: 4

Related Questions