Reputation: 1050
I've been running into this issue where order() appears to be working incorrectly. At this point, I think it's due to an issue with the data type. Similar results have appeared even when using ORDER BY in SQL. Please advise:
# read data from file
data <- read.csv("data/the_data.csv",
colClasses = "character")
# create a new data frame with rate converted to numeric
temp <- cbind(data$State, data$Hospital.Name,
as.numeric(
data$
Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure
))
# add column names to the new data frame
colnames(temp) <- c("state","hospital","rate")
# remove any cases that include NA values
d <- data.frame(temp[complete.cases(temp),])
# reduce to cases that are restricted to Alabama
d <- d[d$state == "AL",]
# order the dataframe by rate, break any ties using
# the alphabetical order of the hospital name
d <- d[order(d$rate,d$hospital),]
Here's my output:
state hospital rate
21 AL ANDALUSIA REGIONAL HOSPITAL 10.1
14 AL JACKSON HOSPITAL & CLINIC INC 10.2
81 AL BIRMINGHAM VA MEDICAL CENTER 10.4
42 AL FLORALA MEMORIAL HOSPITAL 10.4
...
30 AL MEDICAL CENTER ENTERPRISE 12.9
61 AL TRINITY MEDICAL CENTER 12.9
69 AL MONROE COUNTY HOSPITAL 13
31 AL ST VINCENTS BLOUNT 13
...
8 AL DEKALB REGIONAL MEDICAL CENTER 16.6
15 AL GEORGE H. LANIER MEMORIAL HOSPITAL 8.8
79 AL EVERGREEN MEDICAL CENTER 9.1
80 AL BAPTIST MEDICAL CENTER EAST 9.6
38 AL LAWRENCE MEDICAL CENTER 9.9
I've tried the same ordering request in 'data.table', 'dplyr', and 'sqldf'. They've all yielded this similar result. The ordering starts at about 10, goes up to about 16 and then it decides that 8.8 is less than 16.6 and starts over.
Could you please tell me why this is happening?
EDIT: providing more information about the data
dput(droplevels(head(d,20))
Results in this:
structure(list(state = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "AL", class = "factor"),
hospital = structure(c(1L, 10L, 19L, 4L, 7L, 14L, 3L, 12L,
15L, 20L, 5L, 8L, 11L, 13L, 6L, 18L, 17L, 9L, 2L, 16L), .Label = c("ANDALUSIA REGIONAL HOSPITAL",
"ATMORE COMMUNITY HOSPITAL", "BIRMINGHAM VA MEDICAL CENTER",
"FLORALA MEMORIAL HOSPITAL", "GADSDEN REGIONAL MEDICAL CENTER",
"GEORGIANA HOSPITAL", "GROVE HILL MEMORIAL HOSPITAL", "HALE COUNTY HOSPITAL",
"JACK HUGHSTON MEMORIAL HOSPITAL", "JACKSON HOSPITAL & CLINIC INC",
"MOBILE INFIRMARY", "PARKWAY MEDICAL CENTER", "RIVERVIEW REGIONAL MEDICAL CENTER",
"SPRINGHILL MEDICAL CENTER", "ST VINCENT'S BIRMINGHAM", "ST VINCENT'S EAST",
"ST VINCENT'S ST CLAIR", "WALKER BAPTIST MEDICAL CENTER",
"WEDOWEE HOSPITAL", "WIREGRASS MEDICAL CENTER"), class = "factor"),
rate = structure(c(1L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 5L, 5L,
6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 8L), .Label = c("10.1",
"10.2", "10.4", "10.5", "10.6", "10.7", "10.8", "10.9"), class = "factor")), .Names = c("state",
"hospital", "rate"), row.names = c(21L, 14L, 17L, 42L, 53L, 77L,
81L, 34L, 36L, 40L, 24L, 55L, 66L, 28L, 29L, 51L, 74L, 87L, 88L,
7L), class = "data.frame")
When I read in the data using data.table 'rate' is still a factor instead of numeric:
data <- read.table("data/outcome-of-care-measures.csv")
str(d)
Results:
'data.frame': 90 obs. of 3 variables:
$ state : Factor w/ 54 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
$ hospital: Factor w/ 3775 levels "ABBEVILLE AREA MEDICAL CENTER",..: 74 1435 3640 971 1150 3033 292 2418 3212 3742 ...
$ rate : Factor w/ 105 levels "10","10.1","10.2",..: 2 3 5 5 5 5 5 6 7 7 ...
Referenced this stackoverflow post. Tried this:
data <- read.csv("data/outcome-of-care-measures.csv", colClasses = "character")
f <- data$Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure
summary(f)
Length Class Mode
4706 character character
f <- as.numeric(levels(f))[f]
summary(f)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
NA NA NA NaN NA NA 4706
The conclusion that I'm reaching is that the Factor variable cannot be converted to numeric. Therefore, it cannot be ordered. Please let me know if you think otherwise. I'm partial to General Ackbar on this one, "it's a trap!"
Upvotes: 1
Views: 11915
Reputation: 887511
You can convert the d$rate
to numeric
column
d$rate <- as.numeric(as.character(d$rate)
d1 <- d[order(d$rate, d$hospital),]
I would suspect this occured as a result of using colClasses=character
in the read.csv
(not tested). You could have used colClasses=c('character', 'character', 'numeric')
For example, if I read the example data
using read.table
d <- read.table('the_data.csv', colClasses='character')
str(d)
#'data.frame': 13 obs. of 3 variables:
#$ state : chr "AL" "AL" "AL" "AL" ...
#$ hospital: chr "ANDALUSIA REGIONAL HOSPITAL" "JACKSON HOSPITAL & CLINIC INC" "BIRMINGHAM VA MEDICAL CENTER" "FLORALA MEMORIAL HOSPITAL" ...
# $ rate : chr "10.1" "10.2" "10.4" "10.4" ...
Even without specifying the colClasses
, this reads correctly. If you don't want factor
columns, you can use stringsAsFactors=FALSE
in the read.table
d <- read.table('the_data.csv')
str(d)
#'data.frame': 13 obs. of 3 variables:
#$ state : Factor w/ 1 level "AL": 1 1 1 1 1 1 1 1 1 1 ...
#$ hospital: Factor w/ 13 levels "ANDALUSIA REGIONAL HOSPITAL",..: 1 8 3 6 10 13 11 12 4 7 ...
#$ rate : num 10.1 10.2 10.4 10.4 12.9 12.9 13 13 16.6 8.8 ...
d[order(d$rate, d$hospital),]$rate
#[1] 8.8 9.1 9.6 9.9 10.1 10.2 10.4 10.4 12.9 12.9 13.0 13.0 16.6
Using the dput
dataset
d$rate <- as.numeric(as.character(d$rate))
str(d)
#'data.frame': 20 obs. of 3 variables:
#$ state : Factor w/ 1 level "AL": 1 1 1 1 1 1 1 1 1 1 ...
#$ hospital: Factor w/ 20 levels "ANDALUSIA REGIONAL HOSPITAL",..: 1 10 19 4 7 14 3 12 15 20 ...
#$ rate : num 10.1 10.2 10.4 10.4 10.4 10.4 10.4 10.5 10.6 10.6 ...
d[order(d$rate, d$hospital),]$rate
#[1] 10.1 10.2 10.4 10.4 10.4 10.4 10.4 10.5 10.6 10.6 10.7 10.7 10.7 10.8 10.8
#[16] 10.8 10.8 10.8 10.8 10.9
Upvotes: 3