jOasis
jOasis

Reputation: 394

Display hive table with max value per group

The given hive query.

select field_name, sum(frequency) as total_sum, ontology_uri 
from feed_one.dd_instance 
where feed_name="mrna"
group by ontology_uri,field_name order by total_sum desc;

Generates the following table:

**field_name**          **total_sum**   **ontology_uri**
tcga_participant_barcode    1944    http://data.bioontology.org/ontologies/NCIT
gene                        1251    http://data.bioontology.org/ontologies/GEXO
protocol                    612     http://data.bioontology.org/ontologies/PR
gene                        378     http://data.bioontology.org/ontologies/OGG
geneid                      315     http://data.bioontology.org/ontologies/NCIT
gene                        315     http://data.bioontology.org/ontologies/RETO
cohort                      270     http://data.bioontology.org/ontologies/NCIT

I would like to use its result to display maximum total_sum per field_name. That is, following are my expectation:

**field_name**          **max(total_sum)**  **ontology_uri**
tcga_participant_barcode    1944    http://data.bioontology.org/ontologies/NCIT
gene                        1251    http://data.bioontology.org/ontologies/GEXO
protocol                    612     http://data.bioontology.org/ontologies/PR
geneid                      315     http://data.bioontology.org/ontologies/NCIT
cohort                      270     http://data.bioontology.org/ontologies/NCIT

I tried the following HQL, but led me to org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 3:1 mismatched input 'select' expecting ( near '(' in expression specification

select * from A
where (A.field_name, A.total_sum) IN
(select A.field_name, max(A.total_sum) as maximum
from 
(
select field_name, sum(frequency) as total_sum, ontology_uri 
from feed_one.dd_instance 
where feed_name="mrna"
group by ontology_uri,field_name order by total_sum desc
) as A
group by A.field_name);

Upvotes: 2

Views: 1808

Answers (1)

nobody
nobody

Reputation: 11090

select t1.field_name, max(t1.total_sum),t1.ontology_uri
from
(
   select field_name, sum(frequency) as total_sum, ontology_uri 
   from feed_one.dd_instance 
   where feed_name="mrna"
   group by ontology_uri,field_name
) t1
group by t1.field_name, t1.ontology_uri

Using row_number() and partition

select * from 
(
   select 
        field_name,
        ontology_uri,
        sum(frequency) over (partition by field_name,ontology_uri) as total_sum,
        row_number() over (partition by field_name,ontology_uri order by total_sum desc) as row_no
   from feed_one.dd_instance 
   where feed_name="mrna"
) res 
where res.row_no = 1

Upvotes: 2

Related Questions