John
John

Reputation: 4006

How to find the first value in a prescribed list?

I have a table with multiple records. For example

PERSON, COLOR
John, orange
John, blue
Jack, green
Jack, purple

I know I can get min, max, first records using those operations. Is there a way to get the first value based on an ordered list. For the given example data above I'd like to get the first value in the order {red,orange,yellow,green,blue,indigo,violet}. So the result of the query would be

person, color
John, orange
Jack, green

Is there some syntax similar to this in standard sql or other specific dialects?

select
  person,
  first(color)
from (
  select * from person_color 
  order by color{'red','orange','yellow','green','blue','indigo','violet'})
;

Upvotes: 2

Views: 74

Answers (4)

stack0114106
stack0114106

Reputation: 8711

I like Gordon's answer. Here is another spark-sql solution using join and row_number.

Input:

val df = spark.sql(""" 
select 'John' person, 'orange' color union all 
select 'John', 'blue' union all
select 'Jack', 'green' union all 
select 'Jack', 'purple'
""")
df.createOrReplaceTempView("df")
df.show(false)

+------+------+
|person|color |
+------+------+
|John  |orange|
|John  |blue  |
|Jack  |green |
|Jack  |purple|
+------+------+

Now use the distinct(person) and explode an array

val df2=spark.sql(""" 
with t1 (select distinct(person) person from df )
select person, posexplode(array('red','orange','yellow','green','blue','indigo','violet'))  (pos,color)  from t1
""")
df2.show(false)
df2.createOrReplaceTempView("df2")

+------+---+------+
|person|pos|color |
+------+---+------+
|Jack  |0  |red   |
|Jack  |1  |orange|
|Jack  |2  |yellow|
|Jack  |3  |green |
|Jack  |4  |blue  |
|Jack  |5  |indigo|
|Jack  |6  |violet|
|John  |0  |red   |
|John  |1  |orange|
|John  |2  |yellow|
|John  |3  |green |
|John  |4  |blue  |
|John  |5  |indigo|
|John  |6  |violet|
+------+---+------+

Do a join on person and color, assign row_number then filter on it.

spark.sql("""
with t1 ( select a.person,a.color, b.pos, row_number() over(partition by a.person order by pos) rw from df a join df2 b on a.person=b.person and a.color=b.color)
select person, color from t1 where rw=1
""").show

+------+------+
|person| color|
+------+------+
|  Jack| green|
|  John|orange|
+------+------+

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269443

You can use row_number():

select t.*
from (select t.*,
             row_number() over (partition by person
                                order by case color when 'red' then 1 when 'orange' then 2 when 'yellow' then 3 when 'green' then 4 when 'blue' when 'indigo' then 5 when 'violet' then 6 else 7 end
                               ) as seqnum
      from t
     ) t
where seqnum = 1;

Upvotes: 3

Jacek Laskowski
Jacek Laskowski

Reputation: 74619

Would the following be of any help?

val colors = Seq("red","orange","yellow","green","blue","indigo","violet").zipWithIndex.toDF("name", "rank")
// +------+----+
// |  name|rank|
// +------+----+
// |   red|   0|
// |orange|   1|
// |yellow|   2|
// | green|   3|
// |  blue|   4|
// |indigo|   5|
// |violet|   6|
// +------+----+

val peopleWithColors = spark.read.option("header", true).csv("input.csv")
// +------+------+
// |PERSON| COLOR|
// +------+------+
// |  John|orange|
// |  John|  blue|
// |  Jack| green|
// |  Jack|purple|
// +------+------+

val peopleColorsRanked = peopleWithColors.join(colors).where('color === 'name)
// +------+------+------+----+
// |PERSON| COLOR|  name|rank|
// +------+------+------+----+
// |  John|orange|orange|   1|
// |  Jack| green| green|   3|
// |  John|  blue|  blue|   4|
// +------+------+------+----+

And the gist of the solution is to use Windowed Aggregation (over person column and sorted).

import org.apache.spark.sql.expressions.Window
val byPerson = Window.partitionBy('person).orderBy('rank)
val solution = peopleColorsRanked.withColumn("r", min('rank) over byPerson).where('rank === 'r)
// +------+------+------+----+---+
// |PERSON| COLOR|  name|rank|  r|
// +------+------+------+----+---+
// |  Jack| green| green|   3|  3|
// |  John|orange|orange|   1|  1|
// +------+------+------+----+---+

Upvotes: 0

Jon Armstrong
Jon Armstrong

Reputation: 4694

For an example of a dialect specific solution:

MySQL / MariaDB support: FIND_IN_SET and FIELD.

WITH data0 (person, color) AS (
         SELECT 'John', 'orange' UNION
         SELECT 'John', 'blue'   UNION
         SELECT 'Jack', 'green'  UNION
         SELECT 'Jack', 'purple'
     )
   , data1 AS (
         SELECT t.*
              , FIND_IN_SET(color, 'red,orange,yellow,green,blue,indigo,violet') AS seq
           FROM data0 AS t
     )
   , data AS (
         SELECT t.*
              , ROW_NUMBER() OVER (PARTITION BY person ORDER BY seq < 1, seq) AS n
           FROM data1 AS t
     )
SELECT person, color, seq, n
  FROM data
 WHERE n = 1
;

and

WITH data0 (person, color) AS (
         SELECT 'John', 'orange' UNION
         SELECT 'John', 'blue'   UNION
         SELECT 'Jack', 'green'  UNION
         SELECT 'Jack', 'purple'
     )
   , data1 AS (
         SELECT t.*
              , FIELD(color, 'red','orange','yellow','green','blue','indigo','violet') AS seq
           FROM data0 AS t
     )
   , data AS (
         SELECT t.*
              , ROW_NUMBER() OVER (PARTITION BY person ORDER BY seq < 1, seq) AS n
           FROM data1 AS t
     )
SELECT person, color, seq, n
  FROM data
 WHERE n = 1
;

Result:

+--------+--------+-----+---+
| person | color  | seq | n |
+--------+--------+-----+---+
| Jack   | green  |   4 | 1 |
| John   | orange |   2 | 1 |
+--------+--------+-----+---+

Working test case

Upvotes: 0

Related Questions