Reputation: 3883
I love how easy dplyr
and tidyr
have made it to create a single summary table with multiple predictor and outcome variables. One thing that got me stumped was the final step of preserving/defining the order of the predictor variables, and their factor levels, in the output table.
I've come up with a solution of sorts (below), which involves using mutate
to manually make a factor variable that combines both the predictor and predictor value (eg. "gender_female") with levels in the desired output order. But my solution is a bit long winded if there are many variables, and I wonder if there is a better way?
library(dplyr)
library(tidyr)
levels_eth <- c("Maori", "Pacific", "Asian", "Other", "European", "Unknown")
levels_gnd <- c("Female", "Male", "Unknown")
set.seed(1234)
dat <- data.frame(
gender = factor(sample(levels_gnd, 100, replace = TRUE), levels = levels_gnd),
ethnicity = factor(sample(levels_eth, 100, replace = TRUE), levels = levels_eth),
outcome1 = sample(c(TRUE, FALSE), 100, replace = TRUE),
outcome2 = sample(c(TRUE, FALSE), 100, replace = TRUE)
)
dat %>%
gather(key = outcome, value = outcome_value, contains("outcome")) %>%
gather(key = predictor, value = pred_value, gender, ethnicity) %>%
# Statement below creates variable for ordering output
mutate(
pred_ord = factor(interaction(predictor, addNA(pred_value), sep = "_"),
levels = c(paste("gender", levels(addNA(dat$gender)), sep = "_"),
paste("ethnicity", levels(addNA(dat$ethnicity)), sep = "_")))
) %>%
group_by(pred_ord, outcome) %>%
summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
ungroup() %>%
spread(key = outcome, value = n) %>%
separate(pred_ord, c("Predictor", "Pred_value"))
Source: local data frame [9 x 4]
Predictor Pred_value outcome1 outcome2
(chr) (chr) (int) (int)
1 gender Female 25 27
2 gender Male 11 10
3 gender Unknown 12 15
4 ethnicity Maori 10 9
5 ethnicity Pacific 7 7
6 ethnicity Asian 6 12
7 ethnicity Other 10 9
8 ethnicity European 5 4
9 ethnicity Unknown 10 11
Warning message:
attributes are not identical across measure variables; they will be dropped
The table above is correct in that neither the Predictor nor Predictor values are resorted alphabetically.
EDIT
As requested, this is what is produced if the default ordering (alphabetical) is used. It makes sense in that when the factors are combined they are converted to a character variable and all attributes are dropped.
dat %>%
gather(key = outcome, value = outcome_value, contains("outcome")) %>%
gather(key = predictor, value = pred_value, gender, ethnicity) %>%
group_by(predictor, pred_value, outcome) %>%
summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
spread(key = outcome, value = n)
Source: local data frame [9 x 4]
predictor pred_value outcome1 outcome2
(chr) (chr) (int) (int)
1 ethnicity Asian 6 12
2 ethnicity European 5 4
3 ethnicity Maori 10 9
4 ethnicity Other 10 9
5 ethnicity Pacific 7 7
6 ethnicity Unknown 10 11
7 gender Female 25 27
8 gender Male 11 10
9 gender Unknown 12 15
Warning message:
attributes are not identical across measure variables; they will be dropped
Upvotes: 6
Views: 4351
Reputation: 41
You can prefix your variables with values that force them into the right order, such as "X1_gender", "X2_ethnicity". The prefixes can be stripped with a mutate at the end. This is probably not a "tidy" solution, but it worked for my purposes on the problem that led me to this post.
library(dplyr)
library(tidyr)
levels_eth <- c("Maori", "Pacific", "Asian", "Other", "European", "Unknown")
levels_gnd <- c("Female", "Male", "Unknown")
set.seed(1234)
dat <- data.frame(
X1_gender = factor(sample(levels_gnd, 100, replace = TRUE), levels = levels_gnd),
X2_ethnicity = factor(sample(levels_eth, 100, replace = TRUE), levels = levels_eth),
outcome1 = sample(c(TRUE, FALSE), 100, replace = TRUE),
outcome2 = sample(c(TRUE, FALSE), 100, replace = TRUE)
)
dat %>%
gather(key = outcome, value = outcome_value, contains("outcome")) %>%
gather(key = predictor, value = pred_value, X1_gender, X2_ethnicity) %>%
group_by(predictor, pred_value, outcome) %>%
summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
spread(key = outcome, value = n) %>%
mutate(predictor=gsub("^X[0-9]_","", predictor))
Result:
`summarise()` regrouping output by 'predictor', 'pred_value' (override with
`.groups` argument)
# A tibble: 9 x 4
# Groups: predictor, pred_value [9]
predictor pred_value outcome1 outcome2
<chr> <chr> <int> <int>
1 gender Female 16 21
2 gender Male 12 15
3 gender Unknown 18 16
4 ethnicity Asian 4 6
5 ethnicity European 13 13
6 ethnicity Maori 4 6
7 ethnicity Other 7 11
8 ethnicity Pacific 10 9
9 ethnicity Unknown 8 7
Warning message:
attributes are not identical across measure variables;
they will be dropped
Upvotes: 0
Reputation: 23200
You can do this in a much more concise and efficient way without special packages:
rbind(aggregate(dat[,colnames(dat) %in% c("outcome1", "outcome2")],
by = list(dat$gender), sum),
aggregate(dat[,colnames(dat) %in% c("outcome1", "outcome2")],
by = list(dat$ethnicity), sum))
It aggregates the multiple predictors and outcome variables in a simple and direct way, and also avoids having to create that variable that was part of the complicated solution you mentioned.
Group.1 outcome1 outcome2 1 Female 25 27 2 Male 11 10 3 Unknown 12 15 4 Maori 10 9 5 Pacific 7 7 6 Asian 6 12 7 Other 10 9 8 European 5 4 9 Unknown 10 11
If you want to rename the columns above just assign it to an object (e.g. mytable <-
) and rename them (i.e. colnames(mytable) <- c("Pred_value", "outcome1", "outcome2")
). You could also scale it up with an apply
if there are too many variables to type.
Upvotes: 5
Reputation: 43344
If you want your data to be factors arranged as such, you'll need to convert them back to factors, as gather
coerces to character (which it warns you about). You can use gather
's factor_key
parameter to take care of predictor
, but you'll need to assemble levels for pred_value
as it now combines two factors from the original. Simplifying a bit:
library(tidyr)
library(dplyr)
dat %>%
gather(key = predictor, value = pred_value, gender, ethnicity, factor_key = TRUE) %>%
group_by(predictor, pred_value) %>%
summarise_all(sum) %>%
ungroup() %>%
mutate(pred_value = factor(pred_value, levels = unique(c(levels_eth, levels_gnd),
fromLast = TRUE))) %>%
arrange(predictor, pred_value)
## # A tibble: 9 × 4
## predictor pred_value outcome1 outcome2
## <fctr> <fctr> <int> <int>
## 1 gender Female 25 27
## 2 gender Male 11 10
## 3 gender Unknown 12 15
## 4 ethnicity Maori 10 9
## 5 ethnicity Pacific 7 7
## 6 ethnicity Asian 6 12
## 7 ethnicity Other 10 9
## 8 ethnicity European 5 4
## 9 ethnicity Unknown 10 11
Note that you'll need to use unique
with fromLast = TRUE
to arrange the duplicate "Unknown" values into a single occurrence in the right place; union
will put it earlier.
Upvotes: 11