Suvarna Pattayil
Suvarna Pattayil

Reputation: 5239

Count and group by in Hive

I have a table in the hive as follows,

table1

Cola   | Colb  |  Colc |  Cold  |
---------------------------------
...etc
efo18   691 123 5692                                 
efo18   691 345 5657
...etc
fsx31   950 291 23456                                                         
fsx31   950 404 23456                                                          
fsx31   950 343 23456                                                         
fsx31   950 182 23456                                                         
fsx31   950 120 45042                                                         
fsx31   950 161 23456  
....etc
klz57   490 121 3330                                                          
klz57   490 113 3330                                                          
klz57   490 308 3330                                                          
klz57   490 411 3330                                                           
klz57   490 161 3330                                                          
klz57   386 108 3330                                                          
klz57   490 113 3330                                                          
klz57   490 125 3330                                                          
klz57   490 165 3330                                                          
klz57   490 166 3330  
...etc
---------------------------------

I wanted another table from table1 data with those having same value for Cold in a group and within that, those having the same Colb to have a sub-group and within that sub-group, those havin same values of Cola to belong to a group. In other words, each unique combination of Cola,Colb,Cold to be a row.And repeated rows be summed up.

insert into table table2(Col1 string,Col2 string,Col3 string,Count int) select cola,colb,cold,count(*) from table1 group by cold,colb,cola;

I expected this,

Col1   | Col2  |  Col3     |  Count  |
-------------------------------------
efo18    691     5692         1
efo18    691     5657         1
fsx31    950     23456        5   <-----1
fsx31    950     45042        1   <-----1
klz57    490     1234         9   <-----2
klz57    386     1234         1   <-----2
--------------------------------------

I got this,

table2

Col1   | Col2  |  Col3     |  Count  |
-------------------------------------
efo18    691     5692         1
efo18    691     5657         1
fsx31    950     23456        4   <-----1
fsx31    950     25456        1   <-----1
fsx31    950     45042        1   <-----1
klz57    490     1234         8   <-----2
klz57    386     1234         1   <-----2
klz57    490     1234         1   <-----2
--------------------------------------

What I don't understand is I am doing a group by on Cold then on Colb followed by Cola, then why does the Count for the rows marked (<----1),for values from Cola, is in different rows, even though everything falls in the same group? Colc is different for these two rows, but I haven't made use of it in grouping sp how are the two rows different?. Similarly for rows marked (<----2), what is the problem here.

Update:

Binary01, I was trying out the example you gave

hive> select * from xyz;
OK
x        y       z      zz
xxx     111     222     123 NULL    NULL    NULL
xxx     111     222     123 NULL    NULL    NULL
xxx     101     222     123 NULL    NULL    NULL
xux     111     422     123 NULL    NULL    NULL
xxx     111     522     323 NULL    NULL    NULL
xyx     111     622     123 NULL    NULL    NULL
xxx     115     322     123 NULL    NULL    NULL
xxx     111     122     123 NULL    NULL    NULL
xxx     111     223     123 NULL    NULL    NULL
xxy     111     212     143 NULL    NULL    NULL
xxx     117     222     123 NULL    NULL    NULL

What are those NULL values doing there? I have copy pasted your example line by line. Even creating the table as ,

create table xyz(x string ,y string, z string , zz string) 
row format delimited fields terminated by ',';

and the final query gives,

hive> select * from xyztemp;
OK
xux     111     422     123 NULL    NULL    1
xxx     101     222     123 NULL    NULL    1
xxx     111     122     123 NULL    NULL    1
xxx     111     222     123 NULL    NULL    2
xxx     111     223     123 NULL    NULL    1
xxx     111     522     323 NULL    NULL    1
xxx     115     322     123 NULL    NULL    1
xxx     117     222     123 NULL    NULL    1
xxy     111     212     143 NULL    NULL    1
xyx     111     622     123 NULL    NULL    1

Upvotes: 3

Views: 26796

Answers (2)

Binary01
Binary01

Reputation: 695

You must have missed something. I tried with the following data similar to your table and check the output its perfectly fine as you expected.

hive>set hive.cli.print.header=true;
hive> load data local inpath '/home/brdev/sudeep/testdata.txt' into table xyz;
hive> create table xyz(x string ,y string, z string , zz string) row format delimited fields terminated by ',';
hive> select * from xyz;
OK
x       y       z       zz
xxx     111     222     123
xxx     111     222     123
xxx     101     222     123
xux     111     422     123
xxx     111     522     323
xyx     111     622     123
xxx     115     322     123
xxx     111     122     123
xxx     111     223     123
xxy     111     212     143
xxx     117     222     123

hive>create table xyztemp ( aa string,bb string,cc string , dd int);
hive>insert into table xyztemp select x,y,zz,count(*) from xyz group by zz,y,x;
hive> select * from xyztemp;
OK
aa      bb      cc      dd
xxx     101     123     1
xux     111     123     1
xxx     111     123     4
xyx     111     123     1
xxx     115     123     1
xxx     117     123     1
xxy     111     143     1
xxx     111     323     1

I guess the above is the expected output you are looking for.

Upvotes: 4

Lukas Vermeer
Lukas Vermeer

Reputation: 5940

Not sure this is the problem here, but I've seen similar results in cases where the column values had leading or trailing spaces in some rows. Hive will not consider 'fsx31' and 'fsx31 ' (note the trailing space for the latter) to be the same value and will therefore give them individual rows when grouping.

What happens when you use TRIM to remove leading and trailing spaces, like so?

INSERT INTO table table2(Col1 STRING, Col2 STRING, Col3 STRING, Count INT) 
SELECT cola, colb, cold, count(*) 
FROM (
    SELECT
        TRIM(cola) AS cola, 
        TRIM(colb) AS colb, 
        TRIM(colb) AS colb 
    FROM table1
    )
GROUP BY cold, colb, cola;

(Using a subquery to trim and alias the columns is not strictly necessary, but I hope it helps to clarify what the intent is.)

Hope that helps.

Upvotes: 1

Related Questions