Monika
Monika

Reputation: 143

Spark dataframe to a list

I have a spark dataframe with below schema:

root
 |-- cluster_info: struct (nullable = true)
 |    |-- cluster_id: string (nullable = true)
 |    |-- influencers: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- screenName: string (nullable = true)

And I need to get unique list of screenName and I am doing it using below code. But collect is a very heavy operation, is there a better way to do it.

var namesDF = df.select(concat_ws(",", $"cluster_info.influencers.screenName").as("screenName"))
val influencerNameList: List[String] = namesDF.map(r => r(0).asInstanceOf[String]).collect().toList.mkString(",").split(",").toList.distinct

Please suggest. Thanks in advance.

Upvotes: 1

Views: 214

Answers (1)

koiralo
koiralo

Reputation: 23119

You can select nested field screenName as array and explode it and get the distinct values as below

var namesDF = df.select($"cluster_info.influencers.screenName").as("screenName"))
  .withColumn("screenName", explode($"screenName"))
  .select("screenName.screenName")
  .distinct()

You already got the distinct screenName To get the list you can use

namesDF.rdd.map(_.getString(0).collect()

But I don't suggest you to collect the result if you have big dataset

Hope this helps!

Upvotes: 2

Related Questions