de_a
de_a

Reputation: 73

Convert data.table with one id and one variable column to presence matrix

I have a data.table with the following structure:

num_id  value
1000    A1
1001    A1
1000    A2
1000    A3
1001    A54
1002    A55
1001    A100

and would like to turn it into a dt of the form

num_id A1        A2       A3       A54       A55       A100
1000   1         1        1        0         0         0
1001   1         0        0        1         0         1
1002   0         0        0        0         1         0

I thought this would be easy using dcast. The formula that came to mind was dcast(dt, numid~value) However that complained with Cross product of elements provided to CJ() would result in 4850158203 rows which exceeds .Machine$integer.max == 2147483647. This is more than the number of expected rows, because I have about 500,000 unique IDs. After running tests on a smaller datatable it appears the call to dcast is keeping the IDs exactly as they are, replacing the value column by a vector of columns where only 1 element is not null. That is not helping much since the essential aggregation/grouping step is missing.

I wrote the following code, which works but is slow and convoluted. Is there a way to do this in a single dcast call?

futurecolumns=unique(dt$value)
aggregated=dt[,list(list(value)), by=num_id]
out=t(sapply(aggregated$V1, function(x){futurecolumns %in% x}))
out=as.data.table(out*1)
out$num_id=aggregated$num_id
setnames(out, c(futurecolumns, "num_id"))

Upvotes: 1

Views: 412

Answers (2)

hello_friend
hello_friend

Reputation: 5788

Base R one obfuscated expression:

aggregate(. ~ num_id,
          data.frame(num_id = df$num_id,
                     +sapply(unique(df$value), `==`, df$value)), sum)

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388982

One way would to be to count number of rows for num_id and value and use pivot_wider :

library(dplyr)

dt %>%
  count(num_id, value) %>%
  tidyr::pivot_wider(names_from = value, values_from = n, 
                     values_fill = list(n = 0))

# A tibble: 3 x 7
#  num_id    A1    A2    A3   A54   A55  A100
#   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1   1000     1     1     1     0     0     0
#2   1001     1     0     0     1     0     1
#3   1002     0     0     0     0     1     0

In base R you can use aggregate :

futurecolumns=unique(dt$value)
aggregate(value~num_id, dt, function(x) table(factor(x, levels = futurecolumns)))

Upvotes: 1

Related Questions