Reputation: 51
This is an extension to an existing question: Convert table into matrix by column names
I am using the final answer: https://stackoverflow.com/a/2133898/1287275
The original CSV file matrix has about 1.5M rows with three columns ... row index, column index, and a value. All numbers are long integers. The underlying matrix is a sparse matrix about 220K x 220K in size with an average of about 7 values per row.
The original read.table works just fine.
x <- read.table("/users/wallace/Hadoop_Local/reference/DiscoveryData6Mo.csv", header=TRUE);
My problem comes when I do the reshape command.
reshape(x, idvar="page_id", timevar="reco", direction="wide")
The CPU hits 100% and there it sits forever. The machine (a mac) has more memory than R is using. I don't see why it should take so long to construct a sparse matrix.
I am using the default matrix package. I haven't installed anything extra. I just downloaded R a few days ago, so I should have the latest version.
Suggestions?
Thanks, Wallace
Upvotes: 5
Views: 3171
Reputation: 37784
The simplest way to do this in base R is with matrix indexing, like this:
# make up data
num.pages <- 100
num.recos <- 100
N <- 300
set.seed(5)
df <- data.frame(page_id = sample.int(num.pages, N, replace=TRUE),
reco = sample.int(num.recos, N, replace=TRUE),
value = runif(N))
# now get the desired matrix
out <- matrix(nrow=num.pages, ncol=num.recos)
out[cbind(df$page_id, df$reco)] <- df$value
However, in this case, your resulting matrix would be 220k*220k, which would take more memory than you have, so you need to use a package specifically for sparse matrices, as @flodel describes.
Upvotes: 3
Reputation: 89097
I would use the sparseMatrix
function from the Matrix
package. The typical usage is sparseMatrix(i, j, x)
where i
, j
, and x
are three vectors of same length: respectively, the row indices, col indices, and values of the non-zero elements in the matrix. Here is an example where I have tried to match variable names and dimensions to your specifications:
num.pages <- 220000
num.recos <- 230000
N <- 1500000
df <- data.frame(page_id = sample.int(num.pages, N, replace=TRUE),
reco = sample.int(num.recos, N, replace=TRUE),
value = runif(N))
head(df)
# page_id reco value
# 1 33688 48648 0.3141030
# 2 78750 188489 0.5591290
# 3 158870 13157 0.2249552
# 4 38492 56856 0.1664589
# 5 70338 138006 0.7575681
# 6 160827 68844 0.8375410
library("Matrix")
mat <- sparseMatrix(i = df$page_id,
j = df$reco,
x = df$value,
dims = c(num.pages, num.recos))
Upvotes: 5