Wallace
Wallace

Reputation: 51

R - convert BIG table into matrix by column names

This is an extension to an existing question: Convert table into matrix by column names

I am using the final answer: https://stackoverflow.com/a/2133898/1287275

The original CSV file matrix has about 1.5M rows with three columns ... row index, column index, and a value. All numbers are long integers. The underlying matrix is a sparse matrix about 220K x 220K in size with an average of about 7 values per row.

The original read.table works just fine.

  x <- read.table("/users/wallace/Hadoop_Local/reference/DiscoveryData6Mo.csv", header=TRUE);

My problem comes when I do the reshape command.

  reshape(x, idvar="page_id", timevar="reco", direction="wide")

The CPU hits 100% and there it sits forever. The machine (a mac) has more memory than R is using. I don't see why it should take so long to construct a sparse matrix.

I am using the default matrix package. I haven't installed anything extra. I just downloaded R a few days ago, so I should have the latest version.

Suggestions?

Thanks, Wallace

Upvotes: 5

Views: 3171

Answers (2)

Aaron - mostly inactive
Aaron - mostly inactive

Reputation: 37784

The simplest way to do this in base R is with matrix indexing, like this:

# make up data
num.pages <- 100
num.recos <- 100
N <- 300
set.seed(5)
df <- data.frame(page_id = sample.int(num.pages, N, replace=TRUE),
                 reco    = sample.int(num.recos, N, replace=TRUE),
                 value   = runif(N))

# now get the desired matrix
out <- matrix(nrow=num.pages, ncol=num.recos)
out[cbind(df$page_id, df$reco)] <- df$value

However, in this case, your resulting matrix would be 220k*220k, which would take more memory than you have, so you need to use a package specifically for sparse matrices, as @flodel describes.

Upvotes: 3

flodel
flodel

Reputation: 89097

I would use the sparseMatrix function from the Matrix package. The typical usage is sparseMatrix(i, j, x) where i, j, and x are three vectors of same length: respectively, the row indices, col indices, and values of the non-zero elements in the matrix. Here is an example where I have tried to match variable names and dimensions to your specifications:

num.pages <- 220000
num.recos <- 230000
N         <- 1500000

df <- data.frame(page_id = sample.int(num.pages, N, replace=TRUE),
                 reco    = sample.int(num.recos, N, replace=TRUE),
                 value   = runif(N))
head(df)
#   page_id   reco     value
# 1   33688  48648 0.3141030
# 2   78750 188489 0.5591290
# 3  158870  13157 0.2249552
# 4   38492  56856 0.1664589
# 5   70338 138006 0.7575681
# 6  160827  68844 0.8375410

library("Matrix")
mat <- sparseMatrix(i = df$page_id,
                    j = df$reco,
                    x = df$value,
                    dims = c(num.pages, num.recos))

Upvotes: 5

Related Questions