Reputation: 4006
I have a table with multiple records. For example
PERSON, COLOR
John, orange
John, blue
Jack, green
Jack, purple
I know I can get min, max, first records using those operations. Is there a way to get the first value based on an ordered list. For the given example data above I'd like to get the first value in the order {red,orange,yellow,green,blue,indigo,violet}. So the result of the query would be
person, color
John, orange
Jack, green
Is there some syntax similar to this in standard sql or other specific dialects?
select
person,
first(color)
from (
select * from person_color
order by color{'red','orange','yellow','green','blue','indigo','violet'})
;
Upvotes: 2
Views: 74
Reputation: 8711
I like Gordon's answer. Here is another spark-sql solution using join and row_number.
Input:
val df = spark.sql("""
select 'John' person, 'orange' color union all
select 'John', 'blue' union all
select 'Jack', 'green' union all
select 'Jack', 'purple'
""")
df.createOrReplaceTempView("df")
df.show(false)
+------+------+
|person|color |
+------+------+
|John |orange|
|John |blue |
|Jack |green |
|Jack |purple|
+------+------+
Now use the distinct(person) and explode an array
val df2=spark.sql("""
with t1 (select distinct(person) person from df )
select person, posexplode(array('red','orange','yellow','green','blue','indigo','violet')) (pos,color) from t1
""")
df2.show(false)
df2.createOrReplaceTempView("df2")
+------+---+------+
|person|pos|color |
+------+---+------+
|Jack |0 |red |
|Jack |1 |orange|
|Jack |2 |yellow|
|Jack |3 |green |
|Jack |4 |blue |
|Jack |5 |indigo|
|Jack |6 |violet|
|John |0 |red |
|John |1 |orange|
|John |2 |yellow|
|John |3 |green |
|John |4 |blue |
|John |5 |indigo|
|John |6 |violet|
+------+---+------+
Do a join on person and color, assign row_number then filter on it.
spark.sql("""
with t1 ( select a.person,a.color, b.pos, row_number() over(partition by a.person order by pos) rw from df a join df2 b on a.person=b.person and a.color=b.color)
select person, color from t1 where rw=1
""").show
+------+------+
|person| color|
+------+------+
| Jack| green|
| John|orange|
+------+------+
Upvotes: 0
Reputation: 1269443
You can use row_number()
:
select t.*
from (select t.*,
row_number() over (partition by person
order by case color when 'red' then 1 when 'orange' then 2 when 'yellow' then 3 when 'green' then 4 when 'blue' when 'indigo' then 5 when 'violet' then 6 else 7 end
) as seqnum
from t
) t
where seqnum = 1;
Upvotes: 3
Reputation: 74619
Would the following be of any help?
val colors = Seq("red","orange","yellow","green","blue","indigo","violet").zipWithIndex.toDF("name", "rank")
// +------+----+
// | name|rank|
// +------+----+
// | red| 0|
// |orange| 1|
// |yellow| 2|
// | green| 3|
// | blue| 4|
// |indigo| 5|
// |violet| 6|
// +------+----+
val peopleWithColors = spark.read.option("header", true).csv("input.csv")
// +------+------+
// |PERSON| COLOR|
// +------+------+
// | John|orange|
// | John| blue|
// | Jack| green|
// | Jack|purple|
// +------+------+
val peopleColorsRanked = peopleWithColors.join(colors).where('color === 'name)
// +------+------+------+----+
// |PERSON| COLOR| name|rank|
// +------+------+------+----+
// | John|orange|orange| 1|
// | Jack| green| green| 3|
// | John| blue| blue| 4|
// +------+------+------+----+
And the gist of the solution is to use Windowed Aggregation (over person
column and sorted).
import org.apache.spark.sql.expressions.Window
val byPerson = Window.partitionBy('person).orderBy('rank)
val solution = peopleColorsRanked.withColumn("r", min('rank) over byPerson).where('rank === 'r)
// +------+------+------+----+---+
// |PERSON| COLOR| name|rank| r|
// +------+------+------+----+---+
// | Jack| green| green| 3| 3|
// | John|orange|orange| 1| 1|
// +------+------+------+----+---+
Upvotes: 0
Reputation: 4694
For an example of a dialect specific solution:
MySQL / MariaDB support: FIND_IN_SET
and FIELD
.
WITH data0 (person, color) AS (
SELECT 'John', 'orange' UNION
SELECT 'John', 'blue' UNION
SELECT 'Jack', 'green' UNION
SELECT 'Jack', 'purple'
)
, data1 AS (
SELECT t.*
, FIND_IN_SET(color, 'red,orange,yellow,green,blue,indigo,violet') AS seq
FROM data0 AS t
)
, data AS (
SELECT t.*
, ROW_NUMBER() OVER (PARTITION BY person ORDER BY seq < 1, seq) AS n
FROM data1 AS t
)
SELECT person, color, seq, n
FROM data
WHERE n = 1
;
and
WITH data0 (person, color) AS (
SELECT 'John', 'orange' UNION
SELECT 'John', 'blue' UNION
SELECT 'Jack', 'green' UNION
SELECT 'Jack', 'purple'
)
, data1 AS (
SELECT t.*
, FIELD(color, 'red','orange','yellow','green','blue','indigo','violet') AS seq
FROM data0 AS t
)
, data AS (
SELECT t.*
, ROW_NUMBER() OVER (PARTITION BY person ORDER BY seq < 1, seq) AS n
FROM data1 AS t
)
SELECT person, color, seq, n
FROM data
WHERE n = 1
;
Result:
+--------+--------+-----+---+
| person | color | seq | n |
+--------+--------+-----+---+
| Jack | green | 4 | 1 |
| John | orange | 2 | 1 |
+--------+--------+-----+---+
Upvotes: 0