Reputation: 3782
I have the following DataFrame df
:
+-------------------+--------+--------------------+
| id| name| type| url|
+-------------------+--------+--------------------+
| 1| NT Note| aaaa| null|
| 1| NT Note| aaaa|http://www.teleab...|
| 1| NT Note| aaaa|http://www.teleab...|
| 1| NT Note| aaaa| null|
| 1| NT Note| aaaa| null|
| 2| ABC| bbbb| null|
| 2| ABC| bbbb| null|
| 2| ABC| bbbb| null|
| 2| ABC| bbbb| null|
+-------------------+--------+--------------------+
I am assigning the most frequent url
and type
values to each node:
def windowSpec = Window.partitionBy("id", "url", "type")
val result = df.withColumn("count", count("url").over(windowSpec))
.orderBy($"count".desc)
.groupBy("id")
.agg(
first("url").as("URL"),
first("type").as("Typel")
)
But in fact I need to prioritize the most frequent non-null url
in order to get the following result:
+-------------------+--------+--------------------+
| id| name| type| url|
+-------------------+--------+--------------------+
| 1| NT Note| aaaa|http://www.teleab...|
| 2| ABC| bbbb| null|
+-------------------+--------+--------------------+
Now I get the below-shown output, because null
is more frequent for the record id 1
:
+-------------------+--------+--------------------+
| id| name| type| url|
+-------------------+--------+--------------------+
| 1| NT Note| aaaa| null|
| 2| ABC| bbbb| null|
+-------------------+--------+--------------------+
Upvotes: 1
Views: 115
Reputation: 1892
You can do this using udf
like below
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
//function to return most frequent url
def mfnURL(arr: WrappedArray[String]): String = {
val filterArr = arr.filterNot(_ == null)
if (filterArr.length == 0)
return null
else {
filterArr.groupBy(identity).maxBy(_._2.size)._1
}
}
//registering udf mfnURL
val mfnURLUDF = udf(mfnURL _)
//applying groupby , agg and udf
df.groupBy("id", "name", "type").agg(mfnURLUDF(collect_list("url")).alias("url")).show
//Sample output
+---+-------+----+--------------------+
| id| name|type| url|
+---+-------+----+--------------------+
| 2| ABC|bbbb| null|
| 1|NT Note|aaaa|http://www.teleab...|
+---+-------+----+--------------------+
Upvotes: 1