jonas
jonas

Reputation: 13969

counts of grouped variables using dplyr

I would like to create a dataframe with confidence intervals for proportions as a final result. I have introduced a variable (tp in my example) as a cut off value to calculate the proportions for. I would like to use the dplyr package to produce the final dataframe. Below is a simplified example:

library(dplyr)

my_names <- c("A","B")
dt <- data.frame(
  Z = sample(my_names,100,replace = TRUE),
  X = sample(1:10, replace = TRUE),
  Y = sample(c(0,1), 100, replace = TRUE)
)  
  my.df <- dt%>%  
    mutate(tp = (X >8)* 1) %>% #multiply by one to convert into numeric
    group_by(Z, tp) %>%
    summarise(n = n()) %>%
    mutate(prop.tp= n/sum(n)) %>%
    mutate(SE.tp = sqrt((prop.tp*(1-prop.tp))/n))%>%
    mutate(Lower_limit = prop.tp-1.96 * SE.tp)%>%
    mutate(Upper_limit = prop.tp+1.96 * SE.tp)

output:

Source: local data frame [4 x 7]
Groups: Z

  Z tp  n   prop.tp      SE.tp Lower_limit Upper_limit
1 A  0 33 0.6346154 0.08382498   0.4703184   0.7989123
2 A  1 19 0.3653846 0.11047236   0.1488588   0.5819104
3 B  0 27 0.5625000 0.09547033   0.3753782   0.7496218
4 B  1 21 0.4375000 0.10825318   0.2253238   0.6496762

However, I would like to calculate the Standard error and the CI:s using the total sample for the groups in column Z, not the splitted sample by the categorical variable tp. So the total sample for A in my example should be n = 33 +19. Any ideas?

Upvotes: 1

Views: 723

Answers (1)

Backlin
Backlin

Reputation: 14842

Not quite sure I get which group you want to compare with which here, but at any rate you have two grouping variables tp = X > 8 and Z. If you want to compare the rows with X > 8 and Z == "A" to all rows with X > 8 you can do it like this

merge(
    dt %>%
        group_by(X > 8) %>%
        summarize(n.X = n()),
    dt %>%
        group_by(X > 8, Z) %>%
        summarise(n.XZ = n()),
    by = "X > 8"
) %>%
    mutate(prop.XZ = n.XZ/n.X) %>%
    mutate(SE = sqrt((prop.XZ*(1-prop.XZ))/n.X))%>%
    mutate(Lower_limit = prop.XZ-1.96 * SE) %>%
    mutate(Upper_limit = prop.XZ+1.96 * SE)
  X > 8 n.X Z n.XZ   prop.XZ         SE Lower_limit Upper_limit
1 FALSE  70 A   37 0.5285714 0.05966378   0.4116304   0.6455124
2 FALSE  70 B   33 0.4714286 0.05966378   0.3544876   0.5883696
3  TRUE  30 A   16 0.5333333 0.09108401   0.3548087   0.7118580
4  TRUE  30 B   14 0.4666667 0.09108401   0.2881420   0.6451913

If you want to turn the problem around and compare X > 8 and Z == "A" to all rows with Z == "A" you can do it like this

merge(
    dt %>%
        group_by(Z) %>%
        summarize(n.Z = n()),
    dt %>%
        group_by(X > 8, Z) %>%
        summarise(n.XZ = n()),
    by = "Z"
) %>%
    mutate(prop.XZ = n.XZ/n.Z) %>%
    mutate(SE = sqrt((prop.XZ*(1-prop.XZ))/n.Z))%>%
    mutate(Lower_limit = prop.XZ-1.96 * SE) %>%
    mutate(Upper_limit = prop.XZ+1.96 * SE)
  Z n.Z X > 8 n.XZ   prop.XZ         SE Lower_limit Upper_limit
1 A  53 FALSE   37 0.6981132 0.06305900   0.5745176   0.8217088
2 A  53  TRUE   16 0.3018868 0.06305900   0.1782912   0.4254824
3 B  47 FALSE   33 0.7021277 0.06670743   0.5713811   0.8328742
4 B  47  TRUE   14 0.2978723 0.06670743   0.1671258   0.4286189

It is a bit messy having to merge two separate groupings, but I don't know if it is possible to ungroup and re-group in the same statement. I am suprised though how difficult it seems to be to use groupings on two different levels (if you can call it that) and hope someone else can come up with a better solution.

Upvotes: 1

Related Questions