palash choubey
palash choubey

Reputation: 33

hive counting same field many times

I need to count how many students are from which college but when i am using below query

select college ,COUNT(*) from students group by college ;

i am getting this result

enter image description here

result showing different counts for same colleges what should i do here so i can get the proper count of colleges

Upvotes: 1

Views: 33

Answers (1)

leftjoin
leftjoin

Reputation: 38335

It seems like you have many different names of the same college, like these

JIIT
"JIIT
jiit

Try to normalize them (convert to uppercase and remove '"'), so it will be the same JIIT after group by:

 select case when college = 'BSA' then 'BSA College of Technology'
        --add other cases
        else --rule for others
            trim(upper(regexp_replace(college,'"',''))) 
         end as college 
       ,COUNT(*)                                    as cnt 
   from students 
  group by 
        case when college = 'BSA' then 'BSA College of Technology'
        --add other cases
        else --rule for others
            trim(upper(regexp_replace(college,'"',''))) 
         end --the same sentence should be in group by, or use subquery instead
;

Apply case to convert more complex strings like MJP ROHILKHAND and M J P ROHILKHAND to the same.

And this happens because the database is not normalized and input is not restricted for College column by College dimension.

Upvotes: 1

Related Questions