sevenSeat
sevenSeat

Reputation: 1

Performing calculations by level in R

I'm trying to perform calculations based off a certain level in my data frame.

YEAR    MONTH   CARRIER ORIGIN  DEST    DEP_DELAY   ARR_DELAY   CANCELLED   
1   2014    1   AA  JFK LAX 14  13  0
2   2014    1   AA  JFK LAX -3  1   0   
3   2014    1   AA  JFK LAX NA  NA  1   
4   2014    1   AA  JFK LAX 65  59  0   
5   2014    1   AA  JFK LAX 110 110 0   
6   2014    1   AA  JFK LAX 17  -8  0   
7   2014    1   AA  JFK LAX 10  -13 0   

For example, I want to group by $CARRIER and find out how many times each carrier had a flight delay. I also want to calculate other things, like mean arrival delays, etc. Can anyone show me how to perform calculations by level in R? Thanks! Hannah

Upvotes: 0

Views: 122

Answers (2)

Richard Erickson
Richard Erickson

Reputation: 2616

There are several different ways to do this. Depending upon your ultimate goals, different approaches offer different advantages.

Here's a comparison of three approaches, using a reproducible result with dummy data:

## Create data
d <- data.frame(CARRIER   = as.factor(c("a", "b",  "a", "c", "b", "a", "c")),
                DEP_DELAY = as.factor(c("Y",  "N", "N", NA, "Y", "N" ,  "Y")),
                ARR_DELAY = as.factor(c("N",  "N", "Y", "N", "Y", "N", "Y")),
                CANCELLED = as.factor(c("N",  "N", "N", "N", NA, "Y", "Y")))                

1) The aggregate function in base R is the perhaps the simplest way to get what you want and I would recommend using it if this is all you want to do:

aggregate(DEP_DELAY ~ CARRIER, d, summary)
    # CARRIER DEP_DELAY.N DEP_DELAY.Y
# 1       a           2           1
# 2       b           1           1 
# 3       c           0           1

2) The plyr package uses a different syntax than base R, but is very powerful. It was written by the Hadley Wickham who wrote the ggplot2 plotting package. The befits of plyr would be its powerful syntax (base R can become clunky when you start to do complicated summaries) and usefulness in manipulating data for ggplot2 (because Wickham wrote both of them and they compliment each other nicely).

library(plyr) # you will need to install this package
ddply(d, .(CARRIER, DEP_DELAY), summary)
#   CARRIER DEP_DELAY ARR_DELAY CANCELLED
#1    a:2       N:2       N:1       N:1  
#2    b:0       Y:0       Y:1       Y:1  
#3    c:0        <NA>      <NA>      <NA>
#4    a:1       N:0       N:1       N:1  
# I clipped the output to save space 

3) The data.tables package uses a third syntax. Likeplyr, it is a powerful library that has its own syntax. Its befit over plyr is that it can handle much larger data sets due to different memory usage in the code.

library(data.table) # You'll also need to install this package
DT = data.table(d) # Convert data.frame to data.table
DT[,summary(DEP_DELAY), by = CARRIER]
#   CARRIER V1
#1:       a  2
#2:       a  1
#3:       b  1
#4:       b  1
#5:       c  0
#6:       c  1
#7:       c  1

If you're just learning R, I would suggest method 1. If you use R more, I suggest learning both because each can be advantageous to have in your toolbox. If you're using larger data sets (~100s MB or larger), I would learn data.table first. If you're wanting to learn ggplot2, I would learn `plyr' first.

Upvotes: 2

CephBirk
CephBirk

Reputation: 6710

You should use the tapply() function.

For example let's say your data.frame is called data. Then you can use:

tapply(data$DEP_DELAY, data$CARRIER, function(x) length(na.omit(x))) to find how many times each carrier had a flight delay.

tapply(data$ARR_DELAY, data$CARRIER, mean, na.rm=TRUE) to find out mean arrival delays.

Upvotes: 0

Related Questions