Reputation: 449
I have a dataframe of the below format. I am producing sample data, but I have thousands of record of similar format:
ORIGIN DEST CARRIER_DELAY WEATHER_DELAY NAS_DELAY
JFK MCO 1 0 47
JFK LAX
JFK MCO 1 2 30
LOG DFW 12 20 3
LOG DFW
I need to group by origin and and destination and calculate number of occurrence(count) of each delay using dplyr function. The values present in delay columns are in minutes. I need to consider the values greater than 0 and increase the count by 1 for those values. There are null values present for certain rows and I need to ignore them as well.
The output should look like below:
ORIGIN DEST CARR_DELAY_COUNT WEATHER_DELAY_COUNT NAS_DELAY_COUNT
JFK MCO 2 1 2
LOG DFW 1 1 1
I am using below dplyr function:
flight.df %>%
group_by(ORIGIN,DEST) %>%
summarize(carr_delay=sum(CARRIER_DELAY,na.rm=TRUE),
weather_delay=sum(WEATHER_DELAY,na.rm=TRUE),
nas_delay=sum(NAS_DELAY,na.rm=TRUE) %>%
group_by() %>%
{.} -> delays.df
The above function will generate sum of delay values grouping by each category of delay for a particular source and destination.
Here how do I need to insert another function for having the count of each delay apart from sum?
Upvotes: 0
Views: 513
Reputation: 887961
We can use data.table
library(data.table)
setDT(df)[, lapply(.SD, function(x) sum(x > 0, na.rm=TRUE)) , .(ORIGIN, DEST)]
# ORIGIN DEST CARRIER_DELAY WEATHER_DELAY NAS_DELAY
#1: JFK MCO 2 1 2
#2: JFK LAX 0 0 0
#3: LOG DFW 1 1 1
NOTE: This straightforward method also provides the correct output as the accepted one.
Upvotes: 0
Reputation: 38520
It is also straightforward to calculate this using the base R function, aggregate
.
aggregate(cbind("CARRIER_DELAY"=CARRIER_DELAY,
"WEATHER_DELAY"=WEATHER_DELAY,
"NAS_DELAY"=NAS_DELAY) ~ ORIGIN + DEST,
data=df, FUN=function(x) sum(x > 0, na.rm=TRUE))
which returns
ORIGIN DEST CARRIER_DELAY WEATHER_DELAY NAS_DELAY
1 LOG DFW 1 1 1
2 JFK MCO 2 1 2
I use cbind
to group the summary variables together and to also give names to the output.
Upvotes: 0
Reputation: 215137
You can use summarize_each
after a group_by
using dplyr
package. You'll have to rename the columns though.
library(dplyr)
df %>% group_by(ORIGIN, DEST) %>% summarize_each(funs(Count = sum(.>0, na.rm=T)))
Source: local data frame [3 x 5]
Groups: ORIGIN [?]
ORIGIN DEST CARRIER_DELAY WEATHER_DELAY NAS_DELAY
(fctr) (fctr) (int) (int) (int)
1 JFK LAX 0 0 0
2 JFK MCO 2 1 2
3 LOG DFW 1 1 1
Upvotes: 4