How to count the frequency of words with CountVectorizer in spark ML?

Question

The below code gives a count vector for each row in the DataFrame:

import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}

val df = spark.createDataFrame(Seq(
  (0, Array("a", "b", "c")),
  (1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")

// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
  .setInputCol("words")
  .setOutputCol("features")
  .fit(df)


cvModel.transform(df).show(false)

The result is:

+---+---------------+-------------------------+
|id |words          |features                 |
+---+---------------+-------------------------+
|0  |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
|1  |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+---+---------------+-------------------------+

How to get total counts of each words, like:

+---+------+------+
|id |words |counts|
+---+------+------+
|0  |a     |  3   |
|1  |b     |  3   |
|2  |c     |  2   |
+---+------+------+

koiralo · Accepted Answer

You can simply explode and groupBy to get the count of each word

cvModel.transform(df).withColumn("words", explode($"words"))
  .groupBy($"words")
  .agg(count($"words").as("counts"))
  .withColumn("id", row_number().over(Window.orderBy("words")) -1)
  .show(false)

Output:

+-----+------+---+
|words|counts|id |
+-----+------+---+
|a    |3     |1  |
|b    |3     |2  |
|c    |2     |3  |
+-----+------+---+

How to count the frequency of words with CountVectorizer in spark ML?

Answers (2)

Related Questions