Reputation: 433
I have multiple columns in a table in hive having around 80 columns. I need to apply the distinct clause on some of the columns and get the first values from the other columns also. Below is the representation of what I am trying to achieve.
select distinct(col1,col2,col3),col5,col6,col7
from abc where col1 = 'something';
All the columns mentioned above are text columns. So I cannot apply group by and aggregate functions.
Upvotes: 4
Views: 4325
Reputation: 3428
You can use row_number
function to solve the problem.
create table temp as
select *, row_number() over (partition by col1,col2,col3) as rn
from abc
where col1 = 'something';
select *
from temp
where rn=1
You can also sort the table while partitioning.
row_number() over (partition by col1,col2,col3 order by col4 asc) as rn
Upvotes: 2
Reputation: 368
DISTINCT is the most overused and least understood function in SQL. It's the last thing that is executed over your entire result set and removes duplicates using ALL columns in your select. You can do a GROUP BY with a string, in fact that is the answer here:
SELECT col1,col2,col3,COLLECT_SET(col4),COLLECT_SET(col5),COLLECT_SET(col6)
FROM abc WHERE col1 = 'something'
GROUP BY col1,col2,col3;
Now that I re-read your question though, I'm not really sure what you are after. You might have to join the table to an aggregate of itself.
Upvotes: 0