Reputation: 324
I have around 34 million rows, each with 23 columns in the store_sales
table in the tpc-ds dataset .
I have a composite primary key
of columns ss_item_sk
and ss_ticket_number
.
Upon running the query SELECT count(DISTINCT <primary key>) ..
I'm able to see that it outputs the total number of rows present in the table.
Now I add another column along with the primary key
, namely ss_sold_date_sk
.
After this, if I run the count
query, I get fewer number of rows being printed than before. Can someone explain to me, via example, why this could happen?
TL;DR
When does adding a column to a composite primary key ever stop making it unique?
Upvotes: 1
Views: 185
Reputation: 44951
create table mytable (c1 string,c2 string);
insert into mytable values ('A','A'),('B',null),('C',null);
select count(distinct c1) as count_distinct from mytable;
+----------------+
| count_distinct |
+----------------+
| 3 |
+----------------+
As Expected - 3 distinct values - 'A','B' and 'C'
select count(distinct concat(c1,c2)) as count_distinct from mytable;
+----------------+
| count_distinct |
+----------------+
| 1 |
+----------------+
As expected. Why? - see next query
select c1,c2,concat(c1,c2) as concat_c1_c2 from mytable;
+----+------+--------------+
| c1 | c2 | concat_c1_c2 |
+----+------+--------------+
| A | A | AA |
| B | NULL | NULL |
| C | NULL | NULL |
+----+------+--------------+
Concatenation with NULL yields NULL
select count(distinct c1,c2) as count_distinct from mytable;
+----------------+
| count_distinct |
+----------------+
| 1 |
+----------------+
BUG!!
Here is a work-around the bug:
select count(distinct struct(c1,c2)) as count_distinct from mytable;
+----------------+
| count_distinct |
+----------------+
| 3 |
+----------------+
Upvotes: 2