Dan Q
Dan Q

Reputation: 2257

How to create vector matrix of movie ratings using R project?

Suppose I am using this data set of movie ratings: http://www.grouplens.org/node/73

It contains ratings in a file formatted as userID::movieID::rating::timestamp

Given this, I want to construct a feature matrix in R project, where each row corresponds to a user and each column indicates the rating that the user gave to the movie (if any).

Example, if the data file contains

1::1::1::10
2::2::2::11
1::2::3::12
2::1::5::13
3::3::4::14

Then the output matrix would look like:

UserID, Movie1, Movie2, Movie3
1, 1, 3, NA
2, 5, 2, NA
3, NA, NA, 3

So is there some built-in way to achieve this in R project. I wrote a simple python script to do the same thing but I bet there are more efficient ways to accomplish this.

Upvotes: 0

Views: 1668

Answers (3)

Iterator
Iterator

Reputation: 20560

Quite simply, you can represent it as a sparse matrix, using sparseMatrix from the Matrix package.

Just create a 3 column coordinate object list, i.e. in the form (i, j, value), say in a data.frame named myDF. Then, execute mySparseMat <- sparseMatrix(i = myDF$i, j = myDF$j, x = myDF$x, dims = c(numRows, numCols) - you need to decide the number of rows and columns, else the maximum indices will be used to decide the size of the matrix.

It's just that simple. Storing sparse data in a dense matrix is inappropriate, if not grotesque.

Upvotes: 0

Martin Morgan
Martin Morgan

Reputation: 46876

From the web site pointed to in a previous question, it appears that you want to represent

> print(object.size(integer(10000 * 72000)), units="Mb")
2746.6 Mb

which should be 'easy' with 8 GB you reference in another question. Also, the total length is less than the maximum vector length in R, so that should be ok too. But see the end of the response for an important caveat!

I created, outside R, a tab-delimited version of the data file. I then read in the information I was interested in

what <- list(User=integer(), Film=integer(), Rating=numeric(), NULL)
x <- scan(fl, what)

the 'NULL' drops the unused timestamp data. The 'User' and 'Film' entries are not sequential, and numeric() on my platform take up twice as much memory as integer(), so I converted User and Film to factor, and Rating to integer() by doubling (original scores are 1 to 5 in increments of 1/2).

x <- list(User=factor(x$User), Film=factor(x$Film),
          Rating=as.integer(2 * x$Rating))

I then allocated the matrix

ratings <- matrix(NA_integer_ ,
                 nrow=length(levels(x$User)),
                 ncol=length(levels(x$Film)),
                 dimnames=list(levels(x$User), levels(x$Film)))

and use the fact that a two-column matrix can be used to index another matrix

ratings[cbind(x$User, x$Film)] <- x$Rating

This is the step where memory use is maximum. I'd then remove unneeded variable

rm(x)

The gc() function tells me how much memory I've used...

> gc()
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    140609    7.6     407500   21.8    350000   18.7
Vcells 373177663 2847.2  450519582 3437.2 408329775 3115.4

... a little over 3 Gb, so that's good.

Having done that, you'll now run in to serious problems. kmeans (from your response to questions on an earlier earlier answer) will not work with missing values

> m = matrix(rnorm(100), 5)
> m[1,1]=NA
> kmeans(m, 2)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

and as a very rough rule of thumb I'd expect ready-made R solutions to requires 3-5 times as much memory as the starting data size. Have you worked through your analysis with a smaller data set?

Upvotes: 0

Vincent Zoonekynd
Vincent Zoonekynd

Reputation: 32381

You can use the dcast function, in the reshape2 package, but the resulting data.frame may be huge (and sparse).

d <- read.delim(
  "u1.base", 
  col.names = c("user", "film", "rating", "timestamp")
)
library(reshape2)
d <- dcast( d, user ~ film, value.var = "rating" )

If your fields are separated by double colons, you cannot use the sep argument of read.delim, which has to be only one character. If you already do some preprocessing outside R, it is easier to do it there (e.g., in Perl, it would just be s/::/\t/g), but you can also do it in R: read the file as a single column, split the strings, and concatenate the result.

d <- read.delim("a")
d <- as.character( d[,1] )   # vector of strings
d <- strsplit( d, "::" )     # List of vectors of strings of characters
d <- lapply( d, as.numeric ) # List of vectors of numbers
d <- do.call( rbind, d )     # Matrix
d <- as.data.frame( d )
colnames( d ) <- c( "user", "movie", "rating", "timestamp" )

Upvotes: 3

Related Questions