Reputation: 73
I have a data.table with the following structure:
num_id value
1000 A1
1001 A1
1000 A2
1000 A3
1001 A54
1002 A55
1001 A100
and would like to turn it into a dt of the form
num_id A1 A2 A3 A54 A55 A100
1000 1 1 1 0 0 0
1001 1 0 0 1 0 1
1002 0 0 0 0 1 0
I thought this would be easy using dcast
. The formula that came to mind was dcast(dt, numid~value)
However that complained with Cross product of elements provided to CJ() would result in 4850158203 rows which exceeds .Machine$integer.max == 2147483647
. This is more than the number of expected rows, because I have about 500,000 unique IDs. After running tests on a smaller datatable it appears the call to dcast
is keeping the IDs exactly as they are, replacing the value column by a vector of columns where only 1 element is not null. That is not helping much since the essential aggregation/grouping step is missing.
I wrote the following code, which works but is slow and convoluted. Is there a way to do this in a single dcast call?
futurecolumns=unique(dt$value)
aggregated=dt[,list(list(value)), by=num_id]
out=t(sapply(aggregated$V1, function(x){futurecolumns %in% x}))
out=as.data.table(out*1)
out$num_id=aggregated$num_id
setnames(out, c(futurecolumns, "num_id"))
Upvotes: 1
Views: 412
Reputation: 5788
Base R one obfuscated expression:
aggregate(. ~ num_id,
data.frame(num_id = df$num_id,
+sapply(unique(df$value), `==`, df$value)), sum)
Upvotes: 1
Reputation: 388982
One way would to be to count
number of rows for num_id
and value
and use pivot_wider
:
library(dplyr)
dt %>%
count(num_id, value) %>%
tidyr::pivot_wider(names_from = value, values_from = n,
values_fill = list(n = 0))
# A tibble: 3 x 7
# num_id A1 A2 A3 A54 A55 A100
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1000 1 1 1 0 0 0
#2 1001 1 0 0 1 0 1
#3 1002 0 0 0 0 1 0
In base R you can use aggregate
:
futurecolumns=unique(dt$value)
aggregate(value~num_id, dt, function(x) table(factor(x, levels = futurecolumns)))
Upvotes: 1