Spark filter multiple group of rows to a single row

Question

I am trying to acheive the following,

Lets say I have a dataframe with the following columns

id  | name  | alias
-------------------
1   | abc   | short
1   | abc   | ailas-long-1
1   | abc   | another-long-alias
2   | xyz   | short_alias
2   | xyz   | same_length
3   | def   | alias_1

I want to groupby id and name and select the shorter alias,

The output I am expecting is

id  | name  | alias
-------------------
1   | abc   | short
2   | xyz   | short_alias
3   | def   | alias_1

I can achevie this using window and row_number, is there anyother efficient method to get the same result. In general, the thrid column filter condition can be anything in this case its the length of the field.

Any help would be much appreciated.

Thank you.

Ramesh Maharjan · Accepted Answer

All you need to do is use length inbuilt function and use that in window function as

from pyspark.sql import functions as f
from pyspark.sql import Window

windowSpec = Window.partitionBy('id', 'name').orderBy('length')

df.withColumn('length', f.length('alias'))\
    .withColumn('length', f.row_number().over(windowSpec))\
    .filter(f.col('length') == 1)\
    .drop('length')\
    .show(truncate=False)

which should give you

+---+----+-----------+
|id |name|alias      |
+---+----+-----------+
|3  |def |alias_1    |
|1  |abc |short      |
|2  |xyz |short_alias|
+---+----+-----------+

Spark filter multiple group of rows to a single row

Answers (2)

Related Questions