Reputation: 335

How to avoid repeat occurrences in a column

I have a table which describes agent who book tickets for different customers. following data describes one customer data.

From the above data what i am expecting is

The out put meaning is, i want to group the queues first he booked some tickets to singapore, then austin, again singapore and delhi

How we can achieve this in SQL please help me on this

if out put is like as follows is also helpful

Upvotes: 1

Answers (4)

Mikhail Berlyant

Reputation: 172993

Below is for BigQuery Standard SQL

#standardSQL
SELECT agent_id, 
  STRING_AGG(DISTINCT travel_destination) AS travel_destination
FROM `project.dataset.table`
GROUP BY agent_id

it will produce following output

Row agent_id    travel_destination   
1   A1001       Singapore,Austin,Delhi

Looks like expected output is Singapore,Austin,Singapore,Delhi - below is yet another option for this

#standardSQL
CREATE TEMP FUNCTION DedupConsecutive(line STRING) RETURNS STRING LANGUAGE js AS """
  return line.split(",").filter(function(value,index,arr){return value != arr[index+1];}).join(",");
""";
SELECT agent_id, 
  DedupConsecutive(STRING_AGG(travel_destination ORDER BY date_of_booking)) destinations
FROM `project.dataset.table`
GROUP BY agent_id

Same sentiment as by Gordon - I cannot think of a simpler solution. :o)

Upvotes: 0

Gordon Linoff

Reputation: 1269923

I would just use lag():

SELECT t.agent_id, t.travel_dest
FROM (SELECT t.*,
             LAG(travel_dest) OVER (PARTITION BY agent_id ORDER BY bookdt) as prev_travel_dest
      FROM t
     ) t
WHERE prev_travel_dest IS NULL OR prev_travel_dest <> travel_dest
ORDER BY agent_id, bookdt;

I cannot think of a simpler solution.

Upvotes: 0

GMB

Reputation: 222482

This is a gaps and island problem. To solve it, you need to generate groups of adjacent records. This is usually done by comparing row numbers accross two different partitions.

Consider:

select 
    agent_id,   
    travel_destination,
    min(date_of_booking) first_date_of_booking,
    max(date_of_booking) max_date_of_booking
from (
    select 
        t.*,
        row_number() 
            over(partition by agent_id order by date_of_booking) rn1,
        row_number() 
            over(partition by agent_id, travel_destination order by date_of_booking) rn2
    from mytable t
) t 
group by 
    agent_id, 
    rn1 - rn2,
    travel_destination
order by first_date_of_booking

Note that I added the beginning and ending date of each group to the answer, because I find that it makes the answer more meaningful.

Another remark: based on your sample data, it is unclear whether you want to put the customerid in the group; I assumed not (if yes, you need to add that column to both partitions).

Demo on DB Fiddle:

Given this (simplified) dataset:

agent_id | travel_destination | customer_id | date_of_booking
:------- | :----------------- | :---------- | :--------------
A1001    | Singapore          | C1001       | 2019-06-10     
A1001    | Singapore          | C1001       | 2019-06-11     
A1001    | Austin             | C1001       | 2019-06-12     
A1001    | Singapore          | C1001       | 2019-06-13     
A1001    | Singapore          | C1001       | 2019-06-14     
A1001    | Dehli              | C1001       | 2019-06-15

The query returns:

agent_id | travel_destination | first_date_of_booking | max_date_of_booking
:------- | :----------------- | :-------------------- | :------------------
A1001    | Singapore          | 2019-06-10            | 2019-06-11         
A1001    | Austin             | 2019-06-12            | 2019-06-12         
A1001    | Singapore          | 2019-06-13            | 2019-06-14         
A1001    | Dehli              | 2019-06-15            | 2019-06-15

To achieve the second output that you demonstrated, you can add another level of aggregation and use string_agg()

select 
    agent_id,
    string_agg(travel_destination order by first_date_of_booking) travel_destination
from (
  -- above query
) t
group by agent_id

Upvotes: 1

marcothesane

Reputation: 6741

Try this - at least if your database has a function like LISTAGG, as in Vertica ...

WITH
-- this is your input - next time put it in so it can be 
-- copy-pasted and formatted to the below ....                                                                                                                                                    
input(agent_id,travel_dest,cust_id,bookdt) AS (
          SELECT 'A1001','Singapore','C1001',DATE '2109-06-10'
UNION ALL SELECT 'A1001','Singapore','C1001',DATE '2019-06-11'
UNION ALL SELECT 'A1001','Austin'   ,'C1001',DATE '2019-06-19'
UNION ALL SELECT 'A1001','Austin'   ,'C1001',DATE '2019-06-19'
UNION ALL SELECT 'A1001','Austin'   ,'C1001',DATE '2019-06-20'
UNION ALL SELECT 'A1001','Singapore','C1001',DATE '2019-07-30'
UNION ALL SELECT 'A1001','Singapore','C1001',DATE '2019-07-31'
UNION ALL SELECT 'A1001','Delhi'    ,'C1001',DATE '2019-08-01'
UNION ALL SELECT 'A1001','Delhi'    ,'C1001',DATE '2019-08-10'
UNION ALL SELECT 'A1001','Delhi'    ,'C1001',DATE '2019-08-10'
UNION ALL SELECT 'A1001','Delhi'    ,'C1001',DATE '2019-08-10'
UNION ALL SELECT 'A1001','Delhi'    ,'C1001',DATE '2019-08-10'
UNION ALL SELECT 'A1001','Delhi'    ,'C1001',DATE '2019-08-25'
)
-- real WITH clause starts here - substitute comma below with "WITH" ...
,
with_prev AS (
  SELECT
    agent_id
  , travel_dest
  , LAG(travel_dest,1,'') OVER (PARTITION BY agent_id ORDER BY bookdt) AS prev_dest
  FROM input
)
,
de_duped AS (
  SELECT
    agent_id
  , travel_dest
   FROM with_prev
   WHERE travel_dest <> prev_dest
)
SELECT
  agent_id
, LISTAGG(travel_dest) AS travel_dest
FROM de_duped
GROUP BY 1
;

You get:

 agent_id |                travel_dest                 
----------+--------------------------------------------
 A1001    | Singapore,Austin,Singapore,Delhi,Singapore

Upvotes: 0

How to avoid repeat occurrences in a column

Answers (4)

Related Questions