How to Calculate Summary Statistics (Standard Error, and Upper and Lower Confidence intervals) using the package data.table in R

Question

Problem

I have a data frame called FID (see below) and I am attempting to use the package data.table to summarize my data. I want to summarise my data by:-

Desired Summarised Data frame

Month
Total frequency of FID per month over 3 years
Mean frequency of FID per month over 3 years
Standard deviation of FID per month over 3 years
Standard error of FID per month over 3 years
Lower confidence levels per month over 3 years
Upper confidence levels per month over 3 years

I can perform some of these procedures separately (see below), but I would like to combine all of the information stated above in the desired data frame list (above) together into one table.

I have read extensively on Stack Overflow pages and other data.table tutorials but I cannot find any information with how to calculate the standard error, and the upper and lower confidence intervals using the package data.table. Does anyone know how to do this?

  ##Summary Statistics table of FID per month over 3 years

   library(data.table)

  ##Produce a data.table object
    FID.Table<-data.table(FID)

   ##R-code
   Mean.FID<-FID_Table[, .(FID.Freq=sum(FID),
                        mean = mean(FID),
                        sd=sd(FID),
                        median=median(FID)), 
                        by = .(Month)]

 ###Summary Statistics table 
       Month FID.Freq      mean        sd median
 1:   January      165 55.000000 10.535654     56
 2:  February      182 60.666667 29.737743     65
 3:     March      179 59.666667 33.291641     43
 4:     April      104 34.666667 16.862186     27
 5:       May      124 41.333333 49.571497     20
 6:      June       10  3.333333  5.773503      0
 7:      July       15  5.000000  4.358899      7
 8:    August      133 44.333333 21.007935     45
 9: September       97 32.333333 21.548395     34
10:   October       82 27.333333 13.051181     26
11:  November       75 25.000000 19.000000     25
12:  December      102 34.000000  4.582576     33

Data frame: FID

structure(list(Year = c(2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2016L, 2016L, 2016L, 
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 
2017L, 2017L, 2017L), Month = structure(c(5L, 4L, 8L, 1L, 9L, 
7L, 6L, 2L, 12L, 11L, 10L, 3L, 5L, 4L, 8L, 1L, 9L, 7L, 6L, 2L, 
12L, 11L, 10L, 3L, 5L, 4L, 8L, 1L, 9L, 7L, 6L, 2L, 12L, 11L, 
10L, 3L), .Label = c("April", "August", "December", "February", 
"January", "July", "June", "March", "May", "November", "October", 
"September"), class = "factor"), FID = c(65L, 88L, 43L, 54L, 
98L, 0L, 0L, 23L, 10L, 15L, 6L, 33L, 56L, 29L, 98L, 23L, 6L, 
10L, 7L, 65L, 53L, 41L, 25L, 30L, 44L, 65L, 38L, 27L, 20L, 0L, 
8L, 45L, 34L, 26L, 44L, 39L)), class = "data.frame", row.names = c(NA, 
-36L))

Andrew · Accepted Answer

Assuming you want the number of rows in each month to be the denominator for your standard error (i.e., .N), then you can use this to create 95% ci's (i.e., * 1.96). Alternatively, if you have missing data, you may want to use sum(!is.na(FID.Freq)) instead of .N. In short, just calculate standard error for each month then add the ci's as columns later:

library(data.table)

setDT(FID)

Mean.FID = FID[, .(FID.Freq=sum(FID),
                   mean = mean(FID),
                   sd=sd(FID),
                   se=sd(FID) / sqrt(.N),
                   median=median(FID)), by = Month]

Mean.FID[,  `:=`(lo_ci = mean - se * 1.96, up_ci = mean + se * 1.96)]

Mean.FID
        Month FID.Freq      mean        sd        se median       lo_ci     up_ci
 1:   January      165 55.000000 10.535654  6.082763     56  43.0777854 66.922215
 2:  February      182 60.666667 29.737743 17.169094     65  27.0152431 94.318090
 3:     March      179 59.666667 33.291641 19.220938     43  21.9936289 97.339704
 4:     April      104 34.666667 16.862186  9.735388     27  15.5853064 53.748027
 5:       May      124 41.333333 49.571497 28.620117     20 -14.7620965 97.428763
 6:      June       10  3.333333  5.773503  3.333333      0  -3.2000000  9.866667
 7:      July       15  5.000000  4.358899  2.516611      7   0.0674415  9.932558
 8:    August      133 44.333333 21.007935 12.128937     45  20.5606169 68.106050
 9: September       97 32.333333 21.548395 12.440972     34   7.9490287 56.717638
10:   October       82 27.333333 13.051181  7.535103     26  12.5645314 42.102135
11:  November       75 25.000000 19.000000 10.969655     25   3.4994760 46.500524
12:  December      102 34.000000  4.582576  2.645751     33  28.8143274 39.185673

How to Calculate Summary Statistics (Standard Error, and Upper and Lower Confidence intervals) using the package data.table in R

Answers (1)

Related Questions