user3299824
user3299824

Reputation: 43

Why is arrange() ignoring values? (Not used with group by)

My question concerns using the arrange function from the dplyr package, I saw some posts on it but all of them involved issues with group by and arrange, and just arrange seems to be causing me the problem. It is only sorting some columns of my data correctly.

I don't know if you'll be able to reproduce my problem with out the data, so here is a link to it. It's the file called outcome of care measures.csv, a data frame of hospitals and other health related variables. I wrote a function best that is supposed to return the hospital with the lowest 30 day mortality rating in given input state for a given of 3 health conditions.

I read the data and assign names for when I want to read the relevant columns like this;

best<-function(ST, outcome){
  library(dyplr)
  data<-read.csv("outcome-of-care-measures.csv", na.strings = "Not available", stringsAsFactors = FALSE)
  outcomes<-c("heart attack"=11, "heart failure"=17, "pneumonia"=23) 

And then I have 3 if branches that each finds the hospital with the lowest mortality rate of the input health condition. My first branch functions just fine, and I can't tell it apart from the one that dosen't function. The branch below returns data whose output column is incorrectly sorted.

  if (outcome=="pneumonia"){
    rel_data<-data[, c(2,7,outcomes["pneumonia"])]
    names(rel_data)<-c("hospital", "state", "outcome")
    sorted<- arrange(rel_data, state, outcome, hospital)
    state_sorted<-subset(sorted, state==ST)
    print(state_sorted$hospital[1])}}

When I call best("MD", "pneumonia") it returns the 10th ranked hospital, not the first. It looks like indicies 1-9 were cut off the top of this column and were pasted at the bottom of the column. Any idea what might be going wrong? If I type in "heart attack" in place of "pneumonia", the column seems to be sorted just fine and I get the correct output. I'm 100% sure the only difference is "pneumonia" instead of "heart attack".

Upvotes: 1

Views: 142

Answers (1)

Stuart Allen
Stuart Allen

Reputation: 1592

Here's a function that does what you're after, using the tidyverse package ecosystem.

getBestHospital <- function(data, state, outcome) {

  # column numbers for health conditions
  outcomes <- c("heart attack" = 11, "heart failure" = 17, "pneumonia" = 23)

  # get name of column to sort by
  sortCol <- colnames(data)[outcomes[outcome]]

  # return top-ranked hospital for given state and outcome
  data %>%
    dplyr::filter(State == state) %>%
    dplyr::arrange_(paste0("`", sortCol, "`")) %>%
    .$`Hospital Name` %>%
    head(1)

}

And here's how to call it:

library(tidyverse)

d <- readr::read_csv("~/../Downloads/outcome-of-care-measures.csv", na = "Not Available")

getBestHospital(d, "MD", "pneumonia")

Note that using na = "Not Available" solves the issues of having non-numeric data in the outcome columns.

Some sample output:

> getBestHospital(d, "MD", "pneumonia")
[1] "GREATER BALTIMORE MEDICAL CENTER"
> getBestHospital(d, "CA", "heart attack")
[1] "GLENDALE ADVENTIST MEDICAL CENTER"
> getBestHospital(d, "FL", "heart failure")
[1] "FLORIDA HOSPITAL HEARTLAND MEDICAL CENTER"

Upvotes: 1

Related Questions