Creating a Random Feature Array in Spark DataFrames

Question

When creating an ALS model, we can extract a userFactors DataFrame and an itemFactors DataFrame. These DataFrames contain a column with an Array.

I would like to generate some random data and union it to the userFactors DataFrame.

Here is my code:

 val df1: DataFrame  = Seq((123, 456, 4.0), (123, 789, 5.0), (234, 456, 4.5), (234, 789, 1.0)).toDF("user", "item", "rating")
val model1 = (new ALS()
 .setImplicitPrefs(true)
 .fit(df1))

val iF = model1.itemFactors
val uF = model1.userFactors

I then create a random DataFrame using a VectorAssembler with this function:

def makeNew(df: DataFrame, rank: Int): DataFrame = {
    var df_dummy = df
    var i: Int = 0
    var inputCols: Array[String] = Array()
    for (i <- 0 to rank) {
       df_dummy = df_dummy.withColumn("feature".concat(i.toString), rand())
       inputCols = inputCols :+ "feature".concat(i.toString)
      }
    val assembler = new VectorAssembler()
      .setInputCols(inputCols)
      .setOutputCol("userFeatures")
    val output = assembler.transform(df_dummy)
    output.select("user", "userFeatures")
  }

I then create the DataFrame with new user IDs and add the random vectors and bias:

val usersDf: DataFrame = Seq(567), (678)).toDF("user")
var usersFactorsNew: DataFrame = makeNew(usersDf, 20)

The problem arises when I union the two DataFrames.

usersFactorsNew.union(uF) produces the error:

 org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct,values:array> <> array at the second column of the second table;;

If I print the schema, the uF DataFrame has a feature vector of type Array[Float] and the usersFactorsNew DataFrame as a feature vector of type Vector.

My question is how to change the type of the Vector to an Array in order to perform the union.

I tried writing this udf with little success:

val toArr: org.apache.spark.ml.linalg.Vector => Array[Double] = _.toArray
val toArrUdf = udf(toArr)

Perhaps the VectorAssembler is not the best option for this task. However, at the moment, it is the only option I have found. I would love to get some recommendations for something better.

Shaido · Accepted Answer

Instead of creating a dummy dataframe and using VectorAssembler to generate a random feature vector, you can simply use an UDF directly. The userFactors from the ALS model will return an Array[Float] so the output from the UDF should match that.

val createRandomArray = udf((rank: Int) => {
  Array.fill(rank)(Random.nextFloat)
})

Note that this will give numbers in the interval [0.0, 1.0] (same as the rand() used in the question), if other numbers are required, modify as fit.

Using a rank of 3 and the userDf:

val usersFactorsNew = usersDf.withColumn("userFeatures", createRandomArray(lit(3)))

will give a dataframe as follows (of course with random feature values)

+----+----------------------------------------------------------+
|user|userFeatures                                              |
+----+----------------------------------------------------------+
|567 |[0.6866711267486822,0.7257031656127676,0.983562255688249] |
|678 |[0.7013908820314967,0.41029552817665327,0.554591149586789]|
+----+----------------------------------------------------------+

Joining this dataframe with the uF dataframe should now be possible.

The reason the UDF did not work should be due to it being an Array[Double] while you need anArray[Float]for theunion. It should be possible to fix with amap(_.toFloat)`.

val toArr: org.apache.spark.ml.linalg.Vector => Array[Float] = _.toArray.map(_.toFloat)
val toArrUdf = udf(toArr)

Creating a Random Feature Array in Spark DataFrames

Answers (2)

Related Questions