Reputation: 21
I have a large data frame of 5 million rows, with three columns. I would like to transform it to a matrix which has as its rows USER_ID, ID as columns, and value as CNT. This could be done with melt
and cast
or
xtabs(CNT ~ USER_ID + ID, data = foo)
however the object created is to large and I get a following error 'dim' specifies too large an array
USER_ID ID CNT
1 1.813e+14 21 1
2 1.559e+14 28 1
6 1.592e+14 71 2
I'm trying to use data.table as is seams to handle large data much better then data.frame, but I can't figure out how to use data.table to create a contingency table I want.
Does any one have any idea how to get this working? I'm also thinking of creating and empty matrix with appropriate dimensions and fill it with appropriate indexes.
Upvotes: 1
Views: 330
Reputation: 269481
Try this using the built in data.frame CO2
:
> xtabs(uptake ~ Treatment + Type, CO2)
Type
Treatment Quebec Mississippi
nonchilled 742.0 545.0
chilled 666.8 332.1
or similarly using tapply
:
> with(CO2, tapply(uptake, list(Treatment, Type), sum))
Quebec Mississippi
nonchilled 742.0 545.0
chilled 666.8 332.1
and now compare to data.table:
> library(data.table)
>
> DT <- data.table(CO2)
> DT[, as.list(tapply(uptake, Type, sum)), by = Treatment]
Treatment Quebec Mississippi
1: nonchilled 742.0 545.0
2: chilled 666.8 332.1
Cautionary Note: If the same levels of Type
do not appear in every Treatment
group then this would not be sufficient. In that case it would be necessary to convert Type
to be a factor in the data table (as it already is in CO2
).
ADDED:
Its actually possible to get rid of tapply
and have a pure data table approach like this:
> DT[, setNames(as.list(.SD[,list(uptake = sum(uptake)), by = Type][, uptake]),
+ levels(Type)), by = Treatment]
Treatment Quebec Mississippi
1: nonchilled 742.0 545.0
2: chilled 666.8 332.1
The cautionary note above applies here too.
Upvotes: 3
Reputation: 49448
Here's another approach (also using CO2
dataset):
dt = data.table(CO2)
dt[, sum(uptake), by = list(Treatment, Type)][,
setNames(as.list(V1), paste(Type)), by = Treatment]
# Treatment Quebec Mississippi
#1: nonchilled 742.0 545.0
#2: chilled 666.8 332.1
Upvotes: 2