Reputation: 143
I have a spark dataframe with below schema:
root
|-- cluster_info: struct (nullable = true)
| |-- cluster_id: string (nullable = true)
| |-- influencers: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- screenName: string (nullable = true)
And I need to get unique list of screenName and I am doing it using below code. But collect is a very heavy operation, is there a better way to do it.
var namesDF = df.select(concat_ws(",", $"cluster_info.influencers.screenName").as("screenName"))
val influencerNameList: List[String] = namesDF.map(r => r(0).asInstanceOf[String]).collect().toList.mkString(",").split(",").toList.distinct
Please suggest. Thanks in advance.
Upvotes: 1
Views: 214
Reputation: 23119
You can select nested field screenName
as array
and explode
it and get the distinct
values as below
var namesDF = df.select($"cluster_info.influencers.screenName").as("screenName"))
.withColumn("screenName", explode($"screenName"))
.select("screenName.screenName")
.distinct()
You already got the distinct
screenName
To get the list you can use
namesDF.rdd.map(_.getString(0).collect()
But I don't suggest you to collect the result if you have big dataset
Hope this helps!
Upvotes: 2