rjzii
rjzii

Reputation: 14563

How can you aggregate a single column based all the other columns?

I have a large number of CSV files that contain survey data. As a result I know that there is a good deal of duplicated data. Thus, I want to summarize the data by finding the frequency of unique responses. Typically I do this by inserting a frequency column with a value of one and then summing it. Like so:

data = aggregate(data$Freq, by = list(ONE = data$ONE, TWO = data$TWO, THREE  = data$THREE), FUN = sum)

In this case that approach is not feasible since the column names are not consistent and are variable in number. Sample data looks like what the following generates:

data = data.frame(
   ONE   = sample(1:2, 40, TRUE), 
   TWO   = sample(1:2, 40, TRUE),
   THREE = sample(1:5, 40, TRUE)); 
data$Freq = 1

How can I preserve the unique values and find the count of their appearance?

Upvotes: 0

Views: 50

Answers (1)

Spacedman
Spacedman

Reputation: 94317

If you want to do it without creating the Freq=1 variable, you can create that "on the fly":

> head(data)
  ONE TWO THREE
1   D   D     C
2   A   B     C
3   C   B     D
4   A   A     D
5   A   B     A

> aggregate(rep(1,nrow(data)) ~ ., data=data, sum)

   ONE TWO THREE rep(1, nrow(data))
1    B   A     A                  1
2    C   A     A                  3
3    D   A     A                  2
4    A   B     A                  1
5    D   B     A                  1

But it does give a rough name to the column. Wrap it in setNames:

> setNames(aggregate(rep(1,nrow(data)) ~ ., data=data, sum), c(names(data),"Freq"))
   ONE TWO THREE Freq
1    B   A     A    1
2    C   A     A    3
3    D   A     A    2
4    A   B     A    1
5    D   B     A    1

Upvotes: 1

Related Questions