Learner Level1
Learner Level1

Reputation: 5

How to order by in SparkSQL?

I have created below data frame, where order by doesn't work !

df_validation = spark.sql("""select
number, TYPE_NAME
from
(
select
\'number\' AS number,                                                              
\'TYPE_NAME\' AS TYPE_NAME,                         
cast(1 as int) as ORD    
union all
SELECT
cast(sso as string) as sso, 
'PROD' AS TYPE_NAME,
cast(2 as int) as ORD
FROM  
    table_info
WHERE  id = {1}
ORDER BY ORD asc
)""".format(id))

The ORDER BY ORD asc, this part doesn't work as it's supposed to. It executes the program, however the order is not fixed. Sometimes it's ascending and sometimes descending.

How can I fix it and make ORDER BY work?

Upvotes: 0

Views: 1064

Answers (3)

Jim Macaulay
Jim Macaulay

Reputation: 5141

You have to use order by to the data frame. Even thought you sort it in the sql query, when it is created as dataframe, the data will not be represented in sorted order. Please use below syntax in the data frame,

    df.orderBy("col1") 

Below is the code,

df_validation = spark.sql("""select
number, TYPE_NAME
from
(
select
\'number\' AS number,                                                              
\'TYPE_NAME\' AS TYPE_NAME,                         
cast(1 as int) as ORD    
union all
SELECT
cast(sso as string) as sso, 
'PROD' AS TYPE_NAME,
cast(2 as int) as ORD
FROM  
table_info
WHERE  id = {1}
ORDER BY ORD asc
)""".format(id)).orderBy("ORD") 

Upvotes: 0

Alexander Pivovarov
Alexander Pivovarov

Reputation: 4990

You just need to put ORDER BY outside of the parenthesis, so your code should be doing SELECT ... FROM (...) ORDER BY ORD instead of SELECT ... FROM (... UNION ALL ... ORDER BY ORD). Your original code is applying ORDER BY to a part of UNION ALL instead of applying it to final result.

And when you apply ORDER BY on a part of UNION ALL, the ordering is not guaranteed to be preserved after the operation. It is true for the majority operations since Spark is designed to work with distributed DataFrames and full ordering is an unusual and costly operation in most cases (+ costly to maintain).

Upvotes: 0

mvasyliv
mvasyliv

Reputation: 1214

    import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .master("local")
  .appName("DataFrame-example")
  .getOrCreate()

val df_validation = spark.sql("""select
    number, TYPE_NAME
    from
    (
    select
    'number_4' AS number,
    'TYPE_NAME_4' AS TYPE_NAME,
    cast(4 as int) as ORD
    union all
    select
    'number_3' AS number,
    'TYPE_NAME_3' AS TYPE_NAME,
    cast(3 as int) as ORD
    union all
    select
    'number' AS number,
    'TYPE_NAME' AS TYPE_NAME,
    cast(1 as int) as ORD
    union all
    SELECT
    cast('sso' as string) as number,
    'PROD' AS TYPE_NAME,
    cast(2 as int) as ORD
    ) ORDER BY ORD asc """)

  df_validation.show(false)
//    +--------+-----------+
//    |number  |TYPE_NAME  |
//    +--------+-----------+
//    |number  |TYPE_NAME  |
//    |sso     |PROD       |
//    |number_3|TYPE_NAME_3|
//    |number_4|TYPE_NAME_4|
//    +--------+-----------+
// use DataFrame orderBy

    import org.apache.spark.sql.SparkSession

    val spark = SparkSession.builder()
      .master("local")
      .appName("DataFrame-example")
      .getOrCreate()

    val df_validation = spark.sql("""select
        number, TYPE_NAME, ORD
        from
        (
        select
        'number_4' AS number,
        'TYPE_NAME_4' AS TYPE_NAME,
        cast(4 as int) as ORD
        union all
        select
        'number_3' AS number,
        'TYPE_NAME_3' AS TYPE_NAME,
        cast(3 as int) as ORD
        union all
        select
        'number' AS number,
        'TYPE_NAME' AS TYPE_NAME,
        cast(1 as int) as ORD
        union all
        SELECT
        cast('sso' as string) as number,
        'PROD' AS TYPE_NAME,
        cast(2 as int) as ORD
        ) """)

    df_validation.show(false)
//    +--------+-----------+---+
//    |number  |TYPE_NAME  |ORD|
//    +--------+-----------+---+
//    |number_4|TYPE_NAME_4|4  |
//    |number_3|TYPE_NAME_3|3  |
//    |number  |TYPE_NAME  |1  |
//    |sso     |PROD       |2  |
//    +--------+-----------+---+

    val res1 = df_validation.orderBy('ORD.asc)
    res1.show(false)
//    +--------+-----------+---+
//    |number  |TYPE_NAME  |ORD|
//    +--------+-----------+---+
//    |number  |TYPE_NAME  |1  |
//    |sso     |PROD       |2  |
//    |number_3|TYPE_NAME_3|3  |
//    |number_4|TYPE_NAME_4|4  |
//    +--------+-----------+---+

    val res = res1.drop("ORD")
    res.show(false)
//    +--------+-----------+
//    |number  |TYPE_NAME  |
//    +--------+-----------+
//    |number  |TYPE_NAME  |
//    |sso     |PROD       |
//    |number_3|TYPE_NAME_3|
//    |number_4|TYPE_NAME_4|
//    +--------+-----------+

Upvotes: 1

Related Questions