order() function appears to not be working -- may be due to variable type

Question

I've been running into this issue where order() appears to be working incorrectly. At this point, I think it's due to an issue with the data type. Similar results have appeared even when using ORDER BY in SQL. Please advise:

# read data from file
data <- read.csv("data/the_data.csv",
                colClasses = "character")

# create a new data frame with rate converted to numeric
temp <- cbind(data$State, data$Hospital.Name,
    as.numeric(
      data$
      Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure
      ))

# add column names to the new data frame
colnames(temp) <- c("state","hospital","rate")

# remove any cases that include NA values
d <- data.frame(temp[complete.cases(temp),])

# reduce to cases that are restricted to Alabama
d <- d[d$state == "AL",]

# order the dataframe by rate, break any ties using
# the alphabetical order of the hospital name
d <- d[order(d$rate,d$hospital),]

Here's my output:

state                                          hospital rate
21    AL                       ANDALUSIA REGIONAL HOSPITAL 10.1
14    AL                     JACKSON HOSPITAL & CLINIC INC 10.2
81    AL                      BIRMINGHAM VA MEDICAL CENTER 10.4
42    AL                         FLORALA MEMORIAL HOSPITAL 10.4
...
30    AL                         MEDICAL CENTER ENTERPRISE 12.9
61    AL                            TRINITY MEDICAL CENTER 12.9
69    AL                            MONROE COUNTY HOSPITAL   13
31    AL                                ST VINCENTS BLOUNT   13
...
8     AL                    DEKALB REGIONAL MEDICAL CENTER 16.6
15    AL                GEORGE H. LANIER MEMORIAL HOSPITAL  8.8
79    AL                          EVERGREEN MEDICAL CENTER  9.1
80    AL                       BAPTIST MEDICAL CENTER EAST  9.6
38    AL                           LAWRENCE MEDICAL CENTER  9.9

I've tried the same ordering request in 'data.table', 'dplyr', and 'sqldf'. They've all yielded this similar result. The ordering starts at about 10, goes up to about 16 and then it decides that 8.8 is less than 16.6 and starts over.

Could you please tell me why this is happening?

EDIT: providing more information about the data

dput(droplevels(head(d,20))

Results in this:

structure(list(state = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "AL", class = "factor"), 
hospital = structure(c(1L, 10L, 19L, 4L, 7L, 14L, 3L, 12L, 
15L, 20L, 5L, 8L, 11L, 13L, 6L, 18L, 17L, 9L, 2L, 16L), .Label = c("ANDALUSIA REGIONAL    HOSPITAL", 
"ATMORE COMMUNITY HOSPITAL", "BIRMINGHAM VA MEDICAL CENTER", 
"FLORALA MEMORIAL HOSPITAL", "GADSDEN REGIONAL MEDICAL CENTER", 
"GEORGIANA HOSPITAL", "GROVE HILL MEMORIAL HOSPITAL", "HALE COUNTY HOSPITAL", 
"JACK HUGHSTON MEMORIAL HOSPITAL", "JACKSON HOSPITAL & CLINIC INC", 
"MOBILE INFIRMARY", "PARKWAY MEDICAL CENTER", "RIVERVIEW REGIONAL MEDICAL CENTER", 
"SPRINGHILL MEDICAL CENTER", "ST VINCENT'S BIRMINGHAM", "ST VINCENT'S EAST", 
"ST VINCENT'S ST CLAIR", "WALKER BAPTIST MEDICAL CENTER", 
"WEDOWEE HOSPITAL", "WIREGRASS MEDICAL CENTER"), class = "factor"), 
rate = structure(c(1L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 5L, 5L, 
6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 8L), .Label = c("10.1", 
"10.2", "10.4", "10.5", "10.6", "10.7", "10.8", "10.9"), class = "factor")), .Names =     c("state", 
"hospital", "rate"), row.names = c(21L, 14L, 17L, 42L, 53L, 77L, 
81L, 34L, 36L, 40L, 24L, 55L, 66L, 28L, 29L, 51L, 74L, 87L, 88L, 
7L), class = "data.frame")

When I read in the data using data.table 'rate' is still a factor instead of numeric:

data <- read.table("data/outcome-of-care-measures.csv")
str(d)

Results:

    'data.frame':   90 obs. of  3 variables:
 $ state   : Factor w/ 54 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ hospital: Factor w/ 3775 levels "ABBEVILLE AREA MEDICAL CENTER",..: 74 1435 3640 971 1150 3033 292 2418 3212 3742 ...
 $ rate    : Factor w/ 105 levels "10","10.1","10.2",..: 2 3 5 5 5 5 5 6 7 7 ...

Referenced this stackoverflow post. Tried this:

data <- read.csv("data/outcome-of-care-measures.csv", colClasses = "character")
f <- data$Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure
summary(f)

Length     Class      Mode 
 4706 character character

f <- as.numeric(levels(f))[f]
summary(f)

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 NA      NA      NA     NaN      NA      NA    4706

The conclusion that I'm reaching is that the Factor variable cannot be converted to numeric. Therefore, it cannot be ordered. Please let me know if you think otherwise. I'm partial to General Ackbar on this one, "it's a trap!"

akrun · Accepted Answer

You can convert the d$rate to numeric column

d$rate <- as.numeric(as.character(d$rate)
d1 <- d[order(d$rate, d$hospital),]

I would suspect this occured as a result of using colClasses=character in the read.csv(not tested). You could have used colClasses=c('character', 'character', 'numeric')

For example, if I read the example data using read.table

 d <- read.table('the_data.csv', colClasses='character')
 str(d)
 #'data.frame': 13 obs. of  3 variables:
 #$ state   : chr  "AL" "AL" "AL" "AL" ...
 #$ hospital: chr  "ANDALUSIA REGIONAL HOSPITAL" "JACKSON HOSPITAL & CLINIC INC" "BIRMINGHAM VA MEDICAL CENTER" "FLORALA MEMORIAL HOSPITAL" ...
# $ rate    : chr  "10.1" "10.2" "10.4" "10.4" ...

Even without specifying the colClasses, this reads correctly. If you don't want factor columns, you can use stringsAsFactors=FALSE in the read.table

 d <- read.table('the_data.csv')
 str(d)
 #'data.frame': 13 obs. of  3 variables:
 #$ state   : Factor w/ 1 level "AL": 1 1 1 1 1 1 1 1 1 1 ...
 #$ hospital: Factor w/ 13 levels "ANDALUSIA REGIONAL HOSPITAL",..: 1 8 3 6 10 13 11 12 4 7 ...
 #$ rate    : num  10.1 10.2 10.4 10.4 12.9 12.9 13 13 16.6 8.8 ...

 d[order(d$rate, d$hospital),]$rate
 #[1]  8.8  9.1  9.6  9.9 10.1 10.2 10.4 10.4 12.9 12.9 13.0 13.0 16.6

Update

Using the dput dataset

 d$rate <- as.numeric(as.character(d$rate))
 str(d)
 #'data.frame': 20 obs. of  3 variables:
 #$ state   : Factor w/ 1 level "AL": 1 1 1 1 1 1 1 1 1 1 ...
 #$ hospital: Factor w/ 20 levels "ANDALUSIA REGIONAL    HOSPITAL",..: 1 10 19 4 7 14 3 12 15 20 ...
 #$ rate    : num  10.1 10.2 10.4 10.4 10.4 10.4 10.4 10.5 10.6 10.6 ...

 d[order(d$rate, d$hospital),]$rate
 #[1] 10.1 10.2 10.4 10.4 10.4 10.4 10.4 10.5 10.6 10.6 10.7 10.7 10.7 10.8 10.8
 #[16] 10.8 10.8 10.8 10.8 10.9

order() function appears to not be working -- may be due to variable type

Answers (1)

Update

Related Questions