user2378561
user2378561

Reputation: 21

converting a row of data.frame to column names using data.table

I have a large data frame of 5 million rows, with three columns. I would like to transform it to a matrix which has as its rows USER_ID, ID as columns, and value as CNT. This could be done with melt and cast or

xtabs(CNT ~ USER_ID + ID, data = foo)

however the object created is to large and I get a following error 'dim' specifies too large an array

USER_ID ID CNT
1      1.813e+14 21   1
2      1.559e+14 28   1
6      1.592e+14 71   2

I'm trying to use data.table as is seams to handle large data much better then data.frame, but I can't figure out how to use data.table to create a contingency table I want.
Does any one have any idea how to get this working? I'm also thinking of creating and empty matrix with appropriate dimensions and fill it with appropriate indexes.

Upvotes: 1

Views: 330

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269481

Try this using the built in data.frame CO2 :

> xtabs(uptake ~ Treatment + Type, CO2)
            Type
Treatment    Quebec Mississippi
  nonchilled  742.0       545.0
  chilled     666.8       332.1

or similarly using tapply:

> with(CO2, tapply(uptake, list(Treatment, Type), sum))
           Quebec Mississippi
nonchilled  742.0       545.0
chilled     666.8       332.1

and now compare to data.table:

> library(data.table)
>
> DT <- data.table(CO2)
> DT[, as.list(tapply(uptake, Type, sum)), by = Treatment]
    Treatment Quebec Mississippi
1: nonchilled  742.0       545.0
2:    chilled  666.8       332.1

Cautionary Note: If the same levels of Type do not appear in every Treatment group then this would not be sufficient. In that case it would be necessary to convert Type to be a factor in the data table (as it already is in CO2).

ADDED:

Its actually possible to get rid of tapply and have a pure data table approach like this:

> DT[, setNames(as.list(.SD[,list(uptake = sum(uptake)), by = Type][, uptake]), 
+   levels(Type)), by = Treatment]
    Treatment Quebec Mississippi
1: nonchilled  742.0       545.0
2:    chilled  666.8       332.1

The cautionary note above applies here too.

Upvotes: 3

eddi
eddi

Reputation: 49448

Here's another approach (also using CO2 dataset):

dt = data.table(CO2)
dt[, sum(uptake), by = list(Treatment, Type)][,
     setNames(as.list(V1), paste(Type)), by = Treatment]
#    Treatment Quebec Mississippi
#1: nonchilled  742.0       545.0
#2:    chilled  666.8       332.1

Upvotes: 2

Related Questions