akh22
akh22

Reputation: 701

R data.table : creating a count table of values in multiple columns by using .N

Here is my test DT;

a<-data.table(cluster=sample(LETTERS[1:3], size = 10, replace = T), a=sample(x=1:2, size=10, replace = T), b=sample(x=1:2, size=10, replace = T), c=sample(x=1:2, size=10, replace = T), d=sample(x=1:3, size=10, replace=T))

a
    cluster a b c d
 1:       B 1 2 1 2
 2:       C 1 1 1 1
 3:       B 2 1 1 3
 4:       A 2 2 1 1
 5:       C 2 2 1 2
 6:       A 2 2 1 3
 7:       A 2 2 1 1
 8:       A 2 1 1 2
 9:       C 2 1 1 1
10:       C 2 2 1 1

I use ply package's count to generate a count table as follows;

> a[, lapply(.SD, function(x) count(x)), .SDcols=2:5]
   a.x a.freq b.x b.freq c.x c.freq d.x d.freq
1:   1      2   1      4   1     10   1      5
2:   2      8   2      6   1     10   2      3
3:   1      2   1      4   1     10   3      2

It is pretty ugly but somewhat serves a purpose. The output that I really wish is as follows ;

    a.x a.freq b.x b.freq c.x c.freq d.x d.freq
    1:   1      2   1      4   1     10   1      5
    2:   2      8   2      6  NA     NA   2      3
    3:   NA     NA  NA    NA  NA     NA   3      2

Also, I would like to group them with cluster vectors if possible but adding by=cluster fails. Furthermore, I've tried using UniqueN and .N, which works fine with a single column but not with multiple columns. At this point, I'd really appreciate any pointers.

Upvotes: 1

Views: 683

Answers (2)

Uwe
Uwe

Reputation: 42592

If I understand correctly, the OP wants to count the number of occurrences of each number in columns a to d separately for each cluster.

The sample dataset suggests that the data in columns a to d is of the same type. In this case my preferred approach is to reshape the data from wide to long format using melt(), aggregate the data as required and reshape back to wide format using dcast(). Here, reshaping and aggregation can be performed in one step.

library(magrittr) # piping used to improve readabilty
melt(a, "cluster") %>% 
  dcast(cluster + value ~ variable)
Aggregate function missing, defaulting to 'length'
   cluster value a b c d
1:       A     1 0 1 4 2
2:       A     2 4 3 0 1
3:       A     3 0 0 0 1
4:       B     1 1 1 2 0
5:       B     2 1 1 0 1
6:       B     3 0 0 0 1
7:       C     1 1 2 4 3
8:       C     2 3 2 0 1

Now, the columns a to d contain the count of occurrences. Note that data.table's special symbol .N is equivalent to nrow() (or length(), here ).

Alternatively, we can build a contingency table of the counts using table() which is less compact than the dcast() result above:

melt(a, "cluster") %$% 
  table(value, variable, cluster)
, , cluster = A

     variable
value a b c d
    1 0 1 4 2
    2 4 3 0 1
    3 0 0 0 1

, , cluster = B

     variable
value a b c d
    1 1 1 2 0
    2 1 1 0 1
    3 0 0 0 1

, , cluster = C

     variable
value a b c d
    1 1 2 4 3
    2 3 2 0 1
    3 0 0 0 0

Note that the normal forward-pipe operator %>% has been replaced by the exposition pipe-operator %$% (see help("magrittr")).

Another alternative is to create a contingency table with one row per cluster:

melt(a, "cluster") %>% 
  dcast(cluster ~ value + variable)
Aggregate function missing, defaulting to 'length'
   cluster 1_a 1_b 1_c 1_d 2_a 2_b 2_d 3_d
1:       A   0   1   4   2   4   3   1   1
2:       B   1   1   2   0   1   1   1   1
3:       C   1   2   4   3   3   2   1   0

Finally, for the sake of completeness, we can try to reproduce OP's expected result which repeats the values:

melt(a, "cluster") %>% 
  dcast(cluster ~ variable, list(I, length)) %>% 
  setcolorder(c(1, rep(2:4, each = 2L) + c(0L, 4L))) %>% 
  setnames(names(.) %>% stringr::str_remove("^value_")) %>% 
  print()
   cluster I_a length_a I_b length_b I_c length_c I_d length_d
1:       A   2        4   1        4   1        4   2        4
2:       B   2        2   1        2   1        2   3        2
3:       C   2        4   2        4   1        4   1        4

Data

library(data.table)
a <- fread(
"i    cluster a b c d
 1:       B 1 2 1 2
 2:       C 1 1 1 1
 3:       B 2 1 1 3
 4:       A 2 2 1 1
 5:       C 2 2 1 2
 6:       A 2 2 1 3
 7:       A 2 2 1 1
 8:       A 2 1 1 2
 9:       C 2 1 1 1
10:       C 2 2 1 1", drop = 1L)

Upvotes: 1

akrun
akrun

Reputation: 887891

If we need to use .N, loop over the column names, group by that column, get the .N and cbind with cbind.fill from rowr

library(data.table)
do.call(rowr::cbind.fill, c(lapply(names(a)[-1], 
        function(nm) a[,  .N, by = nm][order(get(nm))]), fill = NA))

data

a <- structure(list(cluster = c("B", "C", "B", "A", "C", "A", "A", 
"A", "C", "C"), a = c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
    b = c(2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L), c = c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), d = c(2L, 1L, 3L, 1L, 
    2L, 3L, 1L, 2L, 1L, 1L)), class = c("data.table", "data.frame"
), row.names = c(NA, -10L))

Upvotes: 1

Related Questions