Arpit Trivedi
Arpit Trivedi

Reputation: 97

How to remove duplicates in hive string?

I have a comma-separated column(string) with duplicate values. I want to remove duplicates:
e.g.

column_name
-----------------
gun,gun,man,gun,man
shuttle,enemy,enemy,run
hit,chase

I want result like:

column_name
----------------
gun,man
shuttle,enemy,run
hit,chase

I am using hive database.

Upvotes: 2

Views: 3206

Answers (2)

David Cui
David Cui

Reputation: 31

If value sort is not a concern:

with mytable as (
  select 'gun,gun,man,gun,man' as column_name union 
  select 'shuttle,enemy,enemy,run' as column_name union
  select 'hit,chase' as column_name
) -- test data

SELECT column_name, concat_ws(',',collect_set(item)) from (
  select distinct column_name, s.item from mytable 
  lateral view explode(split(column_name,',')) s as item
) t 
group by column_name
;
+--------------------------+--------------------+--+
|       column_name        |        _c1         |
+--------------------------+--------------------+--+
| gun,gun,man,gun,man      | gun,man            |
| hit,chase                | chase,hit          |
| shuttle,enemy,enemy,run  | enemy,run,shuttle  |
+--------------------------+--------------------+--+

If want to keep the value sorted:

with mytable as (
  select 'gun,gun,man,gun,man' as column_name union 
  select 'shuttle,enemy,enemy,run' as column_name union
  select 'hit,chase' as column_name
) -- test data

select column_name,concat_ws(',',collect_set(item)) as column_name_distincted
from (
  select column_name,item, min(pos) as pos
  from (
    select column_name,pos,item
    from mytable 
    lateral view posexplode(split(column_name,',')) s as pos,item
  ) t
  group by column_name,item
  order by column_name,pos
) t 
group by column_name
;
+--------------------------+-------------------------+--+
|       column_name        | column_name_distincted  |
+--------------------------+-------------------------+--+
| gun,gun,man,gun,man      | gun,man                 |
| hit,chase                | hit,chase               |
| shuttle,enemy,enemy,run  | shuttle,enemy,run       |
+--------------------------+-------------------------+--+

Upvotes: 1

David דודו Markovitz
David דודו Markovitz

Reputation: 44991

Option 1: keep last occurrence

This will keep the last occurrence of every word.
E.g. 'hello,world,hello,world,hello' will result in 'world,hello'

select  regexp_replace
        (
            column_name
           ,'(?<=^|,)(?<word>.*?),(?=.*(?<=,)\\k<word>(?=,|$))'
           ,''
        )

from    mytable
;

+-------------------+
| gun,man           |
| shuttle,enemy,run |
| hit,chase         |
+-------------------+

Option 2: keep first occurrence

This will keep the first occurrence of every word.
E.g. 'hello,world,hello,world,hello' will result in 'hello,world'

select  reverse            
        (
            regexp_replace
            (
                reverse(column_name)
               ,'(?<=^|,)(?<word>.*?),(?=.*(?<=,)\\k<word>(?=,|$))'
               ,''
            )
        )

from    mytable
;

Option 3: sorted

E.g. 'Cherry,Apple,Cherry,Cherry,Cherry,Banana,Apple' will result in 'Apple,Banana,Cherry'

select  regexp_replace
        (
            concat_ws(',',sort_array(split(column_name,',')))
           ,'(?<=^|,)(?<word>.*?)(,\\k<word>(?=,|$))+'
           ,'${word}'
        )

from    mytable
;

Upvotes: 7

Related Questions