darkownage
darkownage

Reputation: 938

hive count and count distinct not correct

I have a table in Hive that has 20 columns and I want to count unique records and all records per hour.

Table looks like:

CREATE EXTERNAL TABLE test1(
  log_date string, 
  advertiser_creatives_id string, 
  cookieID string, 
) 
STORED AS ORC
LOCATION "/day1orc"
tblproperties ("orc.compress"="ZLIB");

And my query like this:

SELECT Hour(log_date), 
       Count(DISTINCT cookieid) AS UNIQUE, 
       Count(1)                 AS impressions 
FROM   test1 
GROUP  BY Hour(log_date); 

But the results are not correct. I have about 70 million entries and when I do a sum of impressions I only get like 8 million so I suspect the distinct takes too many columns in account.

So how can I fix this so that I get the correct amount of impressions?

** Extra information **

hive.vectorized.execution.enabled is undefined so it is not active.

The same query in TEXT format returns even less rows (about 2.7 million)

Result of COUNT(*): 70643229

Result of COUNT(cookieID): 70643229

Result of COUNT(DISTINCT cookieID): 1440195

Cheers

Upvotes: 1

Views: 3070

Answers (1)

anyoneking
anyoneking

Reputation: 546

I have an example,may be useful for you.I think you "row format delimited fields terminated by" has some problems . I have a text,seperate by "\t",like below:

id    date      value
1   01-01-2014  10
1   03-01-2014  05
1   07-01-2014  40
1   05-01-2014  20
2   05-01-2014  10

but I only create a table have 2 columns, like below:

use tmp ;
create table sw_test(id string,td string) row format delimited fields terminated by '\t' ;
LOAD DATA LOCAL INPATH '/home/hadoop/b.txt' INTO TABLE sw_test;

How do you think the result of "select td from sw_test ;"

NOT

td
01-01-2014  10
03-01-2014  05
07-01-2014  40
05-01-2014  20
05-01-2014  10

BUT

td
01-01-2014
03-01-2014
07-01-2014
05-01-2014
05-01-2014

So,I think you cookie contains some special column include your defined seperator. I hope this can help you . good luck!

Upvotes: 1

Related Questions