Caragh
Caragh

Reputation: 69

Summary tools giving me two different answers when I group by stby

I am using stby in summary tools to calculated weighted descriptive statistics by group. However, when I do this I am getting a different answer compared to when I filter by grouping variable and then apply the descr function in summary tools. See below - mydf = my unfiltered dataframe, score is a 0-10 variable that I want to get the mean of.

##when I filter first and split my df
filtered_male <- mydf$gender %>% filter(gender==1)
with(filtered_male, stby(score, gender, descr, weights = weight))
Weighted Descriptive Statistics  
score by gender  
Data Frame: filtered_male  
Weights: weight  
N: 838  

                           1
--------------- ------------
           Mean         6.86
        Std.Dev         2.93
            Min         0.00
         Median         8.00
            Max        10.00
            MAD         2.97
             CV         0.43
        N.Valid   1509584.07
      Pct.Valid        99.70

##when I don't split my df
with(mydf, stby(score, gender, descr, weights = weight, simplify = TRUE))
Weighted Descriptive Statistics  
score by gender  
Data Frame: mydf 
Weights: weight  
N: 838  

                           1            2
--------------- ------------ ------------
           Mean         7.01         6.79
        Std.Dev         2.81         3.02
            Min         0.00         0.00
         Median         8.00         8.00
            Max        10.00        10.00
            MAD         2.97         2.97
             CV         0.40         0.45
        N.Valid   1715494.12   1379339.65
      Pct.Valid        56.05        45.07

'''

Any idea's on why this is happening or how I fix it to get the correct weighted mean? (I've check the answers manually and the mean where I filter first is correct)

Upvotes: 0

Views: 74

Answers (1)

E.Wiest
E.Wiest

Reputation: 5905

Meanwhile an official fix for this, you can try to produce a valid stby object with the following.

### Packages
library(dplyr)
library(purrr)
library(summarytools)

### Data
mtcars

### Output with summarytools
st=with(mtcars, stby(qsec, cyl,descr, weights = wt,simplify = TRUE))

Initial output :

Weighted Descriptive Statistics  
qsec by cyl  
Data Frame: mtcars  
Weights: wt  
N: 11  

                      4       6       8
--------------- ------- ------- -------
           Mean   19.04   17.95   16.73
        Std.Dev    1.53    1.64    1.21
            Min   16.70   15.50   14.50
         Median   18.87   18.29   17.15
            Max   22.90   20.22   18.00
            MAD    1.48    1.90    0.93
             CV    0.08    0.09    0.07
        N.Valid   34.72   21.50   46.30
      Pct.Valid   33.72   20.88   44.97

To fix the output :

### Replace the values in the stby object with new ones
mtcars %>%
  group_by(cyl) %>%
  group_map(~ descr(.x$qsec,descr, weights = .x$wt)) %>% 
  walk2(.y = 1:length(.),function(x,y){st[[y]][,]<<-.[[y]][,]})

### Bonus, add missing N number for each group
attributes(st[[1]])$data_info$N.Obs<-paste(map_int(1:length(st),~attributes(st[[.x]])$data_info$N.Obs),collapse = ",")

Output :

Weighted Descriptive Statistics  
qsec by cyl  
Data Frame: mtcars  
Weights: wt  
N: 11,7,14  

                       4        6        8
--------------- -------- -------- --------
           Mean    19.38    18.12    16.89
        Std.Dev     1.72     1.59     1.13
            Min    16.70    15.50    14.50
         Median    19.24    18.46    17.34
            Max    22.90    20.22    18.00
            MAD     1.09     2.00     0.71
             CV     0.09     0.09     0.07
        N.Valid    25.14    21.82    55.99
      Pct.Valid   100.00   100.00   100.00

Upvotes: 0

Related Questions