Reputation: 33
I'm trying to format my data in R so that I can then use it properly for different general linear models.
The data is like this:
> str(data)
'data.frame': 1978 obs. of 7 variables:
$ country : Factor w/ 22 levels "AT","BE","CH",..: 8 8 8 8 8 8 8 8 8 8 ...
$ age : num 65 77 36 28 23 15 75 20 44 73 ...
$ gender : Factor w/ 2 levels "male","female": 2 1 1 1 2 2 1 2 2 1 ...
$ education_level : Factor w/ 6 levels "less_than_lower_sec",..: 5 1 3 5 5 2 1 3 3 5 ...
$ good_citizen_importance: Factor w/ 11 levels "00","01","02",..: 11 9 9 9 10 10 7 10 10 9 ...
$ trade : Factor w/ 7 levels "none_apply","member",..: 2 4 4 2 2 4 4 4 2 4 ...
$ relig : Factor w/ 7 levels "none_apply","member",..: 2 2 4 4 4 4 2 5 4 4 ...
Snippet from the data itself:
> head(data)
country age gender education_level good_citizen_importance trade relig
13711 FI 65 female tertiary 10 member member
13712 FI 77 male less_than_lower_sec 08 donated member
13713 FI 36 male upper_sec 08 donated donated
13714 FI 28 male tertiary 08 member donated
13715 FI 23 female tertiary 09 member donated
13716 FI 15 female lower_sec 09 donated donated
And I have managed to do this kind of frequency counts, which means that I'm almost there. But I would like to get all the factors and associated counts of "good_citizen_importance" variable to columns.
> counts <- count(data, c("good_citizen_importance", "trade", "relig", "gender"))
> head(counts)
good_citizen_importance trade relig gender freq
1 00 donated member male 1
2 00 donated donated male 1
3 01 member donated female 1
4 01 donated donated male 2
5 01 donated donated female 1
6 02 member member female 1
This is how I would like to have the data:
> head(counts)
trade relig gender "00" "01" "02" ...
1 donated member male 1 5 7 ...
2 donated donated male 12 2 3 ...
3 member donated female 11 3 1 ...
4 donated donated male 25 1 4 ...
5 donated donated female 12 1 1 ...
6 member member female 11 1 1 ...
So I would like to have the factor frequency for all factors for one variable with the combinations on the other variables. In other words, frequency column for all the 11 factors of the "good_citizen_importance" variable.
I'm sure this is not very hard problem, but I have been fighting this already several hours and I think I have exhausted my R and Google skills right about now.
Upvotes: 1
Views: 52
Reputation: 4551
This can be accomplished by reshaping the data. In base R, the function reshape
can be used, but the syntax is awkward (I used to use it regularly, and I'd have to look up the syntax EVERY time). A better solution is spread
in the tidyverse
suite of packages (specifically, it's in the tidyr
package:
library(tidyr) # or library(tidyverse)
counts_wide <- counts %>%
spread(good_citizen_importance, freq, fill = 0)
If you aren't familiar with the pipe operator (%>%
), it takes the output of the previous function and sets it as the first argument of the next function. It's used to make the code easier to read by removing lots of nested functions.
Upvotes: 1