Reputation: 2257
Suppose I am using this data set of movie ratings: http://www.grouplens.org/node/73
It contains ratings in a file formatted as userID::movieID::rating::timestamp
Given this, I want to construct a feature matrix in R project, where each row corresponds to a user and each column indicates the rating that the user gave to the movie (if any).
Example, if the data file contains
1::1::1::10 2::2::2::11 1::2::3::12 2::1::5::13 3::3::4::14
Then the output matrix would look like:
UserID, Movie1, Movie2, Movie3 1, 1, 3, NA 2, 5, 2, NA 3, NA, NA, 3
So is there some built-in way to achieve this in R project. I wrote a simple python script to do the same thing but I bet there are more efficient ways to accomplish this.
Upvotes: 0
Views: 1668
Reputation: 20560
Quite simply, you can represent it as a sparse matrix, using sparseMatrix
from the Matrix
package.
Just create a 3 column coordinate object list, i.e. in the form (i, j, value)
, say in a data.frame named myDF
. Then, execute mySparseMat <- sparseMatrix(i = myDF$i, j = myDF$j, x = myDF$x, dims = c(numRows, numCols)
- you need to decide the number of rows and columns, else the maximum indices will be used to decide the size of the matrix.
It's just that simple. Storing sparse data in a dense matrix is inappropriate, if not grotesque.
Upvotes: 0
Reputation: 46876
From the web site pointed to in a previous question, it appears that you want to represent
> print(object.size(integer(10000 * 72000)), units="Mb")
2746.6 Mb
which should be 'easy' with 8 GB you reference in another question. Also, the total length is less than the maximum vector length in R, so that should be ok too. But see the end of the response for an important caveat!
I created, outside R, a tab-delimited version of the data file. I then read in the information I was interested in
what <- list(User=integer(), Film=integer(), Rating=numeric(), NULL)
x <- scan(fl, what)
the 'NULL' drops the unused timestamp data. The 'User' and 'Film' entries are not sequential, and numeric()
on my platform take up twice as much memory as integer()
, so I converted User and Film to factor, and Rating to integer() by doubling (original scores are 1 to 5 in increments of 1/2).
x <- list(User=factor(x$User), Film=factor(x$Film),
Rating=as.integer(2 * x$Rating))
I then allocated the matrix
ratings <- matrix(NA_integer_ ,
nrow=length(levels(x$User)),
ncol=length(levels(x$Film)),
dimnames=list(levels(x$User), levels(x$Film)))
and use the fact that a two-column matrix can be used to index another matrix
ratings[cbind(x$User, x$Film)] <- x$Rating
This is the step where memory use is maximum. I'd then remove unneeded variable
rm(x)
The gc()
function tells me how much memory I've used...
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 140609 7.6 407500 21.8 350000 18.7
Vcells 373177663 2847.2 450519582 3437.2 408329775 3115.4
... a little over 3 Gb, so that's good.
Having done that, you'll now run in to serious problems. kmeans (from your response to questions on an earlier earlier answer) will not work with missing values
> m = matrix(rnorm(100), 5)
> m[1,1]=NA
> kmeans(m, 2)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
and as a very rough rule of thumb I'd expect ready-made R solutions to requires 3-5 times as much memory as the starting data size. Have you worked through your analysis with a smaller data set?
Upvotes: 0
Reputation: 32381
You can use the dcast
function, in the reshape2
package, but the resulting data.frame may be huge (and sparse).
d <- read.delim(
"u1.base",
col.names = c("user", "film", "rating", "timestamp")
)
library(reshape2)
d <- dcast( d, user ~ film, value.var = "rating" )
If your fields are separated by double colons, you cannot use the sep
argument of read.delim
, which has to be only one character.
If you already do some preprocessing outside R, it is easier to do it there (e.g., in Perl, it would just be s/::/\t/g
), but you can also do it in R: read the file as a single column, split the strings, and concatenate the result.
d <- read.delim("a")
d <- as.character( d[,1] ) # vector of strings
d <- strsplit( d, "::" ) # List of vectors of strings of characters
d <- lapply( d, as.numeric ) # List of vectors of numbers
d <- do.call( rbind, d ) # Matrix
d <- as.data.frame( d )
colnames( d ) <- c( "user", "movie", "rating", "timestamp" )
Upvotes: 3