Reputation: 701
Here is my test DT;
a<-data.table(cluster=sample(LETTERS[1:3], size = 10, replace = T), a=sample(x=1:2, size=10, replace = T), b=sample(x=1:2, size=10, replace = T), c=sample(x=1:2, size=10, replace = T), d=sample(x=1:3, size=10, replace=T))
a
cluster a b c d
1: B 1 2 1 2
2: C 1 1 1 1
3: B 2 1 1 3
4: A 2 2 1 1
5: C 2 2 1 2
6: A 2 2 1 3
7: A 2 2 1 1
8: A 2 1 1 2
9: C 2 1 1 1
10: C 2 2 1 1
I use ply
package's count
to generate a count table as follows;
> a[, lapply(.SD, function(x) count(x)), .SDcols=2:5]
a.x a.freq b.x b.freq c.x c.freq d.x d.freq
1: 1 2 1 4 1 10 1 5
2: 2 8 2 6 1 10 2 3
3: 1 2 1 4 1 10 3 2
It is pretty ugly but somewhat serves a purpose. The output that I really wish is as follows ;
a.x a.freq b.x b.freq c.x c.freq d.x d.freq
1: 1 2 1 4 1 10 1 5
2: 2 8 2 6 NA NA 2 3
3: NA NA NA NA NA NA 3 2
Also, I would like to group them with cluster vectors if possible but adding by=cluster
fails. Furthermore, I've tried using UniqueN and .N, which works fine with a single column but not with multiple columns. At this point, I'd really appreciate any pointers.
Upvotes: 1
Views: 683
Reputation: 42592
If I understand correctly, the OP wants to count the number of occurrences of each number in columns a
to d
separately for each cluster.
The sample dataset suggests that the data in columns a
to d
is of the same type. In this case my preferred approach is to reshape the data from wide to long format using melt()
, aggregate the data as required and reshape back to wide format using dcast()
.
Here, reshaping and aggregation can be performed in one step.
library(magrittr) # piping used to improve readabilty
melt(a, "cluster") %>%
dcast(cluster + value ~ variable)
Aggregate function missing, defaulting to 'length' cluster value a b c d 1: A 1 0 1 4 2 2: A 2 4 3 0 1 3: A 3 0 0 0 1 4: B 1 1 1 2 0 5: B 2 1 1 0 1 6: B 3 0 0 0 1 7: C 1 1 2 4 3 8: C 2 3 2 0 1
Now, the columns a
to d
contain the count of occurrences. Note that data.table
's special symbol .N
is equivalent to nrow()
(or length()
, here ).
Alternatively, we can build a contingency table of the counts using table()
which is less compact than the dcast()
result above:
melt(a, "cluster") %$%
table(value, variable, cluster)
, , cluster = A variable value a b c d 1 0 1 4 2 2 4 3 0 1 3 0 0 0 1 , , cluster = B variable value a b c d 1 1 1 2 0 2 1 1 0 1 3 0 0 0 1 , , cluster = C variable value a b c d 1 1 2 4 3 2 3 2 0 1 3 0 0 0 0
Note that the normal forward-pipe operator %>%
has been replaced by the exposition pipe-operator %$%
(see help("magrittr")
).
Another alternative is to create a contingency table with one row per cluster:
melt(a, "cluster") %>%
dcast(cluster ~ value + variable)
Aggregate function missing, defaulting to 'length' cluster 1_a 1_b 1_c 1_d 2_a 2_b 2_d 3_d 1: A 0 1 4 2 4 3 1 1 2: B 1 1 2 0 1 1 1 1 3: C 1 2 4 3 3 2 1 0
Finally, for the sake of completeness, we can try to reproduce OP's expected result which repeats the values:
melt(a, "cluster") %>%
dcast(cluster ~ variable, list(I, length)) %>%
setcolorder(c(1, rep(2:4, each = 2L) + c(0L, 4L))) %>%
setnames(names(.) %>% stringr::str_remove("^value_")) %>%
print()
cluster I_a length_a I_b length_b I_c length_c I_d length_d 1: A 2 4 1 4 1 4 2 4 2: B 2 2 1 2 1 2 3 2 3: C 2 4 2 4 1 4 1 4
library(data.table)
a <- fread(
"i cluster a b c d
1: B 1 2 1 2
2: C 1 1 1 1
3: B 2 1 1 3
4: A 2 2 1 1
5: C 2 2 1 2
6: A 2 2 1 3
7: A 2 2 1 1
8: A 2 1 1 2
9: C 2 1 1 1
10: C 2 2 1 1", drop = 1L)
Upvotes: 1
Reputation: 887891
If we need to use .N
, loop over the column names, group by that column, get the .N
and cbind
with cbind.fill
from rowr
library(data.table)
do.call(rowr::cbind.fill, c(lapply(names(a)[-1],
function(nm) a[, .N, by = nm][order(get(nm))]), fill = NA))
a <- structure(list(cluster = c("B", "C", "B", "A", "C", "A", "A",
"A", "C", "C"), a = c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
b = c(2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L), c = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), d = c(2L, 1L, 3L, 1L,
2L, 3L, 1L, 2L, 1L, 1L)), class = c("data.table", "data.frame"
), row.names = c(NA, -10L))
Upvotes: 1