Reputation: 15
I have a rating data set like this: (userId,itemId,rating)
1 100 4
1 101 5
1 102 3
1 10 3
1 103 5
4 353 2
4 354 4
4 355 5
7 420 5
7 421 4
7 422 4
I'm trying to use ALS method to construct a matrix factorization model to obtain user latent features and product latent features by this code:
object AlsTest {
def main(args: Array[String])
{
System.setProperty("hadoop.home.dir","C:\\spark-1.5.1-bin-hadoop2.6\\winutil")
val conf = new SparkConf().setAppName("test").setMaster("local[4]")
val sc = new SparkContext(conf)
// Load and parse the data
val data = sc.textFile("ratings.txt")
val ratings = data.map(_.split(" ") match { case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
// Build the recommendation model using ALS
val rank =10
val numIterations =30
val model = ALS.train(ratings, rank, numIterations, 0.01)
val a = model.productFeatures().cache().collect.foreach(println) //.cache().collect.count()//.collect.foreach(println)
I have set the rank equal 10, and out put format for model.productFeatures() should be a RDD:[(int,Array[Double])] but when I see the out put there is some problems, there are some characters in output(what are these characters) and the number of Array elements in records is different, these are latent features values and counts of them in every records must be equal also,these aren't ten ,exactly equal to rank number. out put is like this:
(48791,7fea9bb7)
(48795,284b451d)
(48799,3d64767d)
(48803,2f812fc3)
(48807,49d3ea7)
(48811,768cf084)
(48815,6845b7b6)
(48819,4e9c724a)
(48823,23191538)
(48827,3200d90f)
(48831,77bd30fe)
(48839,5a1e0261)
(48843,31c56ccf)
(48855,5b90359)
(48863,1b9de9d0)
(48867,313afdc8)
(48871,2b834c34)
(48875,666d21d6)
(48891,12ca97a2)
(48907,74f8fc8e)
(48911,452becc9)
(48915,4a47062b)
(48919,c76ef46)
(48923,3f596eca)
(48927,258e904c)
(48939,570abc88)
(48947,6c3d75f0)
(48951,18667983)
(48955,493b9633)
(48959,4b579d60)
in matrix factorization we should construct two matrix with lower dimensions so that multiply them equal to rating matrix:
rating matrix= p*q(transpose),
p= user latent feature matrix,
q= product latent features matrix,
can any one explain about the out put format of als methods in spark?
Upvotes: 0
Views: 1730
Reputation: 352
To see the latent factors for each product use this syntax:
model.productFeatures.collect().foreach{case (productID,latentFactors) => println("proID:"+ productID + " factors:"+ latentFactors.mkString(",") )}
The result for the given dataset is as follows:
proID:1 factors:-1.262960433959961,-0.5678719282150269,1.5220979452133179,2.2127938270568848,-2.096022129058838,3.2418994903564453,0.9077783823013306,1.1294238567352295,-0.0628235936164856,-0.6788621544837952
proID:2 factors:-0.6275356411933899,-2.0269076824188232,1.735855221748352,3.7356512546539307,0.8256714344024658,1.5638374090194702,1.6725327968597412,-1.9434666633605957,0.868758499622345,0.18945524096488953
proID:3 factors:-1.262960433959961,-0.5678719282150269,1.5220979452133179,2.2127938270568848,-2.096022129058838,3.2418994903564453,0.9077783823013306,1.1294238567352295,-0.0628235936164856,-0.6788621544837952
proID:4 factors:-0.6275356411933899,-2.0269076824188232,1.735855221748352,3.7356512546539307,0.8256714344024658,1.5638374090194702,1.6725327968597412,-1.9434666633605957,0.868758499622345,0.18945524096488953
As you can see each product has exactly 10 factors, which is a correct number according to the given parameter val rank =10
.
To answer your second question, consider that after training the model you can access to the two variables namely userFeatures: RDD[(Int, Array[Double])]
and productFeatures: RDD[(Int, Array[Double])]
. The entries of user-item matrix are determined using dot product of these two variables. For example, if you check out the source code of predict
method, you can understand how we use these variables to predict the rating of specific user for one product:
def predict(user: Int, product: Int): Double = {
val userVector = userFeatures.lookup(user).head
val productVector = productFeatures.lookup(product).head
blas.ddot(rank, userVector, 1, productVector, 1)
}
Upvotes: 1