Reputation: 938
I have a table in Hive that has 20 columns and I want to count unique records and all records per hour.
Table looks like:
CREATE EXTERNAL TABLE test1(
log_date string,
advertiser_creatives_id string,
cookieID string,
)
STORED AS ORC
LOCATION "/day1orc"
tblproperties ("orc.compress"="ZLIB");
And my query like this:
SELECT Hour(log_date),
Count(DISTINCT cookieid) AS UNIQUE,
Count(1) AS impressions
FROM test1
GROUP BY Hour(log_date);
But the results are not correct. I have about 70 million entries and when I do a sum of impressions I only get like 8 million so I suspect the distinct takes too many columns in account.
So how can I fix this so that I get the correct amount of impressions?
** Extra information **
hive.vectorized.execution.enabled
is undefined so it is not active.
The same query in TEXT format returns even less rows (about 2.7 million)
Result of COUNT(*)
: 70643229
Result of COUNT(cookieID)
: 70643229
Result of COUNT(DISTINCT cookieID)
: 1440195
Cheers
Upvotes: 1
Views: 3070
Reputation: 546
I have an example,may be useful for you.I think you "row format delimited fields terminated by" has some problems . I have a text,seperate by "\t",like below:
id date value
1 01-01-2014 10
1 03-01-2014 05
1 07-01-2014 40
1 05-01-2014 20
2 05-01-2014 10
but I only create a table have 2 columns, like below:
use tmp ;
create table sw_test(id string,td string) row format delimited fields terminated by '\t' ;
LOAD DATA LOCAL INPATH '/home/hadoop/b.txt' INTO TABLE sw_test;
How do you think the result of "select td from sw_test ;"
NOT
td
01-01-2014 10
03-01-2014 05
07-01-2014 40
05-01-2014 20
05-01-2014 10
BUT
td
01-01-2014
03-01-2014
07-01-2014
05-01-2014
05-01-2014
So,I think you cookie contains some special column include your defined seperator. I hope this can help you . good luck!
Upvotes: 1