Reputation: 1
I'm trying to perform calculations based off a certain level in my data frame.
YEAR MONTH CARRIER ORIGIN DEST DEP_DELAY ARR_DELAY CANCELLED
1 2014 1 AA JFK LAX 14 13 0
2 2014 1 AA JFK LAX -3 1 0
3 2014 1 AA JFK LAX NA NA 1
4 2014 1 AA JFK LAX 65 59 0
5 2014 1 AA JFK LAX 110 110 0
6 2014 1 AA JFK LAX 17 -8 0
7 2014 1 AA JFK LAX 10 -13 0
For example, I want to group by $CARRIER
and find out how many times each carrier had a flight delay. I also want to calculate other things, like mean arrival delays, etc.
Can anyone show me how to perform calculations by level in R?
Thanks!
Hannah
Upvotes: 0
Views: 122
Reputation: 2616
There are several different ways to do this. Depending upon your ultimate goals, different approaches offer different advantages.
Here's a comparison of three approaches, using a reproducible result with dummy data:
## Create data
d <- data.frame(CARRIER = as.factor(c("a", "b", "a", "c", "b", "a", "c")),
DEP_DELAY = as.factor(c("Y", "N", "N", NA, "Y", "N" , "Y")),
ARR_DELAY = as.factor(c("N", "N", "Y", "N", "Y", "N", "Y")),
CANCELLED = as.factor(c("N", "N", "N", "N", NA, "Y", "Y")))
1) The aggregate
function in base R is the perhaps the simplest way to get what you want and I would recommend using it if this is all you want to do:
aggregate(DEP_DELAY ~ CARRIER, d, summary)
# CARRIER DEP_DELAY.N DEP_DELAY.Y
# 1 a 2 1
# 2 b 1 1
# 3 c 0 1
2) The plyr
package uses a different syntax than base R, but is very powerful. It was written by the Hadley Wickham who wrote the ggplot2
plotting package. The befits of plyr
would be its powerful syntax (base R can become clunky when you start to do complicated summaries) and usefulness in manipulating data for ggplot2
(because Wickham wrote both of them and they compliment each other nicely).
library(plyr) # you will need to install this package
ddply(d, .(CARRIER, DEP_DELAY), summary)
# CARRIER DEP_DELAY ARR_DELAY CANCELLED
#1 a:2 N:2 N:1 N:1
#2 b:0 Y:0 Y:1 Y:1
#3 c:0 <NA> <NA> <NA>
#4 a:1 N:0 N:1 N:1
# I clipped the output to save space
3) The data.tables
package uses a third syntax. Likeplyr
, it is a powerful library that has its own syntax. Its befit over plyr
is that it can handle much larger data sets due to different memory usage in the code.
library(data.table) # You'll also need to install this package
DT = data.table(d) # Convert data.frame to data.table
DT[,summary(DEP_DELAY), by = CARRIER]
# CARRIER V1
#1: a 2
#2: a 1
#3: b 1
#4: b 1
#5: c 0
#6: c 1
#7: c 1
If you're just learning R, I would suggest method 1. If you use R
more, I suggest learning both because each can be advantageous to have in your toolbox. If you're using larger data sets (~100s MB or larger), I would learn data.table
first. If you're wanting to learn ggplot2
, I would learn `plyr' first.
Upvotes: 2
Reputation: 6710
You should use the tapply()
function.
For example let's say your data.frame is called data
. Then you can use:
tapply(data$DEP_DELAY, data$CARRIER, function(x) length(na.omit(x)))
to find how many times each carrier had a flight delay.
tapply(data$ARR_DELAY, data$CARRIER, mean, na.rm=TRUE)
to find out mean arrival delays.
Upvotes: 0