eliminate duplicates from row SQL hive

I have this table

+------------+---------------------+--+
|  country   |      commodity      |
+------------+---------------------+--+
| Argentina  | Copper, molybdenum  |
| Argentina  | Silver, lead        |
| Argentina  | Copper, gold        |
| Argentina  | Copper, gold        |
| Argentina  | Copper              |
| Spain      | Rhodochrosite       |
| Spain      | Copper              |
| Spain      | Limestone           |
| Spain      | Gold                |
| Spain      | Limestone           |
+------------+---------------------+--+

And I want to display this

+------------+-----------------------------------------+--+
|  country   |                   minerals              |
+------------+-----------------------------------------+--+
| Argentina  | copper, molybdenum, silver, lead, gold  |
| Spain      | rhodochrosite, copper, limestone, gold  |
+------------+-----------------------------------------+--+

So I want do join all commodity for each country in one column "minerals" and eliminate duplicates, but in original column "commodity" there can be more then 1 minerals as you can see in first table and also there can be lower or upper case Gold, gold etc.

I tried

SELECT country, CONCAT_WS(', ' ,COLLECT_SET(LOWER(commodity))) as minerals 
FROM depositOPT 
GROUP BY country;

But it didn't eliminate duplicates because output looks like this

    +------------+------------------------------------------------------------------------
    |  country   |                   minerals                  
    +------------+------------------------------------------------------------------------
    | Argentina  | copper, molybdenum, silver, lead, copper, gold, copper, gold, copper  
    | Spain      | rhodochrosite, copper, limestone, gold, limestone    
    +------------+------------------------------------------------------------------------

Thanks for advice.

Upvotes: 1

Views: 990

Answers (1)

jose_bacoy
jose_bacoy

Reputation: 12684

I will split up the commodity columns into individual minerals to remove duplicates then group by country. Hope this helps. Thanks.

 select t.country, 
        collect_set(lower(t.minerals)) as minerals 
 from (select country, trim(minerals) as minerals 
       from depositOPT lateral view explode(split(commodity,',')) s as  minerals) t
group by t.country;

**Sample result:**
Country         Minerals
Argentina       ["copper","molybdenum","silver","lead","gold"]

Upvotes: 1

Related Questions