Abhishek
Abhishek

Reputation: 433

Select distinct on specific columns but select other columns also in hive

I have multiple columns in a table in hive having around 80 columns. I need to apply the distinct clause on some of the columns and get the first values from the other columns also. Below is the representation of what I am trying to achieve.

select distinct(col1,col2,col3),col5,col6,col7
from abc where col1 = 'something';

All the columns mentioned above are text columns. So I cannot apply group by and aggregate functions.

Upvotes: 4

Views: 4325

Answers (2)

Safwan
Safwan

Reputation: 3428

You can use row_number function to solve the problem.

create table temp as
select *, row_number() over (partition by col1,col2,col3) as rn
from abc 
where col1 = 'something';

select *
from temp
where rn=1

You can also sort the table while partitioning.

row_number() over (partition by col1,col2,col3 order by col4 asc) as rn

Upvotes: 2

deusxmach1na
deusxmach1na

Reputation: 368

DISTINCT is the most overused and least understood function in SQL. It's the last thing that is executed over your entire result set and removes duplicates using ALL columns in your select. You can do a GROUP BY with a string, in fact that is the answer here:

SELECT col1,col2,col3,COLLECT_SET(col4),COLLECT_SET(col5),COLLECT_SET(col6)
FROM abc WHERE col1 = 'something'
GROUP BY col1,col2,col3;

Now that I re-read your question though, I'm not really sure what you are after. You might have to join the table to an aggregate of itself.

Upvotes: 0

Related Questions