Reputation: 13
I am new to R and this is the first time I use stackoverflow so excuse me if I ask for something obvious or my question is not clear enough.
I am working with the following data set
dim(storm)
[1] 883602 39
names(storm)
[1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
[6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
[11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
[16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
[21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
[26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
[31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
[36] "REMARKS" "REFNUM" "PROPTOTAL" "CROPTOTAL"
I am interested to use EVTYPE
(a factor variable) to aggregate 4 other numerical variables (PROPTOTAL, CROPTOTAL, FATALITIES, INJURIES
)
The factor variable as 950 levels:
length(unique(storm$EVTYPE))
[1] 950
class(storm$EVTYPE)
[1] "factor"
So I would expect an aggregated data frame with 950 observations and 5 variables when I run the following command:
storm_tidy<-
aggregate(cbind(PROPTOTAL,CROPTOTAL,FATALITIES,INJURIES)~EVTYPE,FUN=sum,data=storm)
However I get only 155
rows
dim(storm_tidy)
[1] 155 5
I am using the aggregate with several columns following the help page of the function (use cbind):
Formulas, one ~ one, one ~ many, many ~ one, and many ~ many:
aggregate(weight ~ feed, data = chickwts, mean)
aggregate(breaks ~ wool + tension, data = warpbreaks, mean)
**aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)**
aggregate(cbind(ncases, ncontrols) ~ alcgp + tobgp, data = esoph, sum)
I am loosing information at some point:
sum(storm$PROPTOTAL)
[1] 424769204805
sum(storm_tidy$PROPTOTAL)
[1] 228366211339
However, if I aggregate column by column it seems to work fine:
storm_tidy <- aggregate(PROPTOTAL~EVTYPE,FUN = sum, data = storm)
dim(storm_tidy)
[1] 950 2
sum(storm_tidy$PROPTOTAL)
[1] 424769204805
What am I missing? What am I doing wrong?
Thanks.
Upvotes: 1
Views: 810
Reputation: 887721
This could be a case where there are missing values in some of the columns and the entire row is deleted based on the default option na.action= na.omit
in the aggregate
. I would try with na.action=NULL
aggregate(cbind(PROPTOTAL,CROPTOTAL,FATALITIES,INJURIES)~EVTYPE,
FUN=sum, na.rm=TRUE, data=storm, na.action=NULL)
Or we can use summarise_each
from dplyr
after grouping by 'EVTYPE`
library(dplyr)
storm %>%
group_by(EVTYPE) %>%
summarise_each(funs(sum=sum(., na.rm=TRUE)),
PROPTOTAL,CROPTOTAL,FATALITIES,INJURIES)
Upvotes: 0