Reputation: 14563
I have a large number of CSV files that contain survey data. As a result I know that there is a good deal of duplicated data. Thus, I want to summarize the data by finding the frequency of unique responses. Typically I do this by inserting a frequency column with a value of one and then summing it. Like so:
data = aggregate(data$Freq, by = list(ONE = data$ONE, TWO = data$TWO, THREE = data$THREE), FUN = sum)
In this case that approach is not feasible since the column names are not consistent and are variable in number. Sample data looks like what the following generates:
data = data.frame(
ONE = sample(1:2, 40, TRUE),
TWO = sample(1:2, 40, TRUE),
THREE = sample(1:5, 40, TRUE));
data$Freq = 1
How can I preserve the unique values and find the count of their appearance?
Upvotes: 0
Views: 50
Reputation: 94317
If you want to do it without creating the Freq=1 variable, you can create that "on the fly":
> head(data)
ONE TWO THREE
1 D D C
2 A B C
3 C B D
4 A A D
5 A B A
> aggregate(rep(1,nrow(data)) ~ ., data=data, sum)
ONE TWO THREE rep(1, nrow(data))
1 B A A 1
2 C A A 3
3 D A A 2
4 A B A 1
5 D B A 1
But it does give a rough name to the column. Wrap it in setNames
:
> setNames(aggregate(rep(1,nrow(data)) ~ ., data=data, sum), c(names(data),"Freq"))
ONE TWO THREE Freq
1 B A A 1
2 C A A 3
3 D A A 2
4 A B A 1
5 D B A 1
Upvotes: 1