Reputation: 5
I have created below data frame, where order by doesn't work !
df_validation = spark.sql("""select
number, TYPE_NAME
from
(
select
\'number\' AS number,
\'TYPE_NAME\' AS TYPE_NAME,
cast(1 as int) as ORD
union all
SELECT
cast(sso as string) as sso,
'PROD' AS TYPE_NAME,
cast(2 as int) as ORD
FROM
table_info
WHERE id = {1}
ORDER BY ORD asc
)""".format(id))
The ORDER BY ORD asc, this part doesn't work as it's supposed to. It executes the program, however the order is not fixed. Sometimes it's ascending and sometimes descending.
How can I fix it and make ORDER BY work?
Upvotes: 0
Views: 1064
Reputation: 5141
You have to use order by to the data frame. Even thought you sort it in the sql query, when it is created as dataframe, the data will not be represented in sorted order. Please use below syntax in the data frame,
df.orderBy("col1")
Below is the code,
df_validation = spark.sql("""select
number, TYPE_NAME
from
(
select
\'number\' AS number,
\'TYPE_NAME\' AS TYPE_NAME,
cast(1 as int) as ORD
union all
SELECT
cast(sso as string) as sso,
'PROD' AS TYPE_NAME,
cast(2 as int) as ORD
FROM
table_info
WHERE id = {1}
ORDER BY ORD asc
)""".format(id)).orderBy("ORD")
Upvotes: 0
Reputation: 4990
You just need to put ORDER BY
outside of the parenthesis, so your code should be doing SELECT ... FROM (...) ORDER BY ORD
instead of SELECT ... FROM (... UNION ALL ... ORDER BY ORD)
. Your original code is applying ORDER BY
to a part of UNION ALL
instead of applying it to final result.
And when you apply ORDER BY
on a part of UNION ALL
, the ordering is not guaranteed to be preserved after the operation. It is true for the majority operations since Spark is designed to work with distributed DataFrames and full ordering is an unusual and costly operation in most cases (+ costly to maintain).
Upvotes: 0
Reputation: 1214
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
val df_validation = spark.sql("""select
number, TYPE_NAME
from
(
select
'number_4' AS number,
'TYPE_NAME_4' AS TYPE_NAME,
cast(4 as int) as ORD
union all
select
'number_3' AS number,
'TYPE_NAME_3' AS TYPE_NAME,
cast(3 as int) as ORD
union all
select
'number' AS number,
'TYPE_NAME' AS TYPE_NAME,
cast(1 as int) as ORD
union all
SELECT
cast('sso' as string) as number,
'PROD' AS TYPE_NAME,
cast(2 as int) as ORD
) ORDER BY ORD asc """)
df_validation.show(false)
// +--------+-----------+
// |number |TYPE_NAME |
// +--------+-----------+
// |number |TYPE_NAME |
// |sso |PROD |
// |number_3|TYPE_NAME_3|
// |number_4|TYPE_NAME_4|
// +--------+-----------+
// use DataFrame orderBy
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
val df_validation = spark.sql("""select
number, TYPE_NAME, ORD
from
(
select
'number_4' AS number,
'TYPE_NAME_4' AS TYPE_NAME,
cast(4 as int) as ORD
union all
select
'number_3' AS number,
'TYPE_NAME_3' AS TYPE_NAME,
cast(3 as int) as ORD
union all
select
'number' AS number,
'TYPE_NAME' AS TYPE_NAME,
cast(1 as int) as ORD
union all
SELECT
cast('sso' as string) as number,
'PROD' AS TYPE_NAME,
cast(2 as int) as ORD
) """)
df_validation.show(false)
// +--------+-----------+---+
// |number |TYPE_NAME |ORD|
// +--------+-----------+---+
// |number_4|TYPE_NAME_4|4 |
// |number_3|TYPE_NAME_3|3 |
// |number |TYPE_NAME |1 |
// |sso |PROD |2 |
// +--------+-----------+---+
val res1 = df_validation.orderBy('ORD.asc)
res1.show(false)
// +--------+-----------+---+
// |number |TYPE_NAME |ORD|
// +--------+-----------+---+
// |number |TYPE_NAME |1 |
// |sso |PROD |2 |
// |number_3|TYPE_NAME_3|3 |
// |number_4|TYPE_NAME_4|4 |
// +--------+-----------+---+
val res = res1.drop("ORD")
res.show(false)
// +--------+-----------+
// |number |TYPE_NAME |
// +--------+-----------+
// |number |TYPE_NAME |
// |sso |PROD |
// |number_3|TYPE_NAME_3|
// |number_4|TYPE_NAME_4|
// +--------+-----------+
Upvotes: 1