How to get the most frequent non-empty value in a column?

Question

I have the following DataFrame df:

+-------------------+--------+--------------------+
|   id|         name|    type|                 url|
+-------------------+--------+--------------------+
|    1|      NT Note|    aaaa|                null|
|    1|      NT Note|    aaaa|http://www.teleab...|
|    1|      NT Note|    aaaa|http://www.teleab...|
|    1|      NT Note|    aaaa|                null|
|    1|      NT Note|    aaaa|                null|
|    2|          ABC|    bbbb|                null|
|    2|          ABC|    bbbb|                null|
|    2|          ABC|    bbbb|                null|
|    2|          ABC|    bbbb|                null|
+-------------------+--------+--------------------+

I am assigning the most frequent url and type values to each node:

def windowSpec = Window.partitionBy("id", "url", "type") 
val result = df.withColumn("count", count("url").over(windowSpec))  
  .orderBy($"count".desc)                                                                                 
  .groupBy("id")                                                                                     
  .agg(
  first("url").as("URL"),
  first("type").as("Typel")
)

But in fact I need to prioritize the most frequent non-null url in order to get the following result:

+-------------------+--------+--------------------+
|   id|         name|    type|                 url|
+-------------------+--------+--------------------+
|    1|      NT Note|    aaaa|http://www.teleab...|
|    2|          ABC|    bbbb|                null|
+-------------------+--------+--------------------+

Now I get the below-shown output, because null is more frequent for the record id 1:

+-------------------+--------+--------------------+
|   id|         name|    type|                 url|
+-------------------+--------+--------------------+
|    1|      NT Note|    aaaa|                null|
|    2|          ABC|    bbbb|                null|
+-------------------+--------+--------------------+

How to get the most frequent non-empty value in a column?

Answers (1)

Related Questions