Reputation: 1534
We have a database table which stores browser data for visitors, broken down by multiple different subtypes. For simplicity, let's use the table schema below. The querying will basically be on any single id column, the metric column, the timestamp column (stored as seconds since epoch), and one of the device, browser, or os columns.
We are going to performance test the star vs snowflake schema (where all of the ids go into a single column, but then an additional column id_type
is added to determine which type of identifier it is) for this table, but as long as the star schema (which is how it is now) is within 80% of the snowflake performance, we are going to keep it since it will make our load process much easier. Before I do that however, I want to make sure the indexes are optimized on the star schema.
create table browser_data (
id_1 int,
id_2 int,
id_3 int,
id_4 int,
metric varchar(20),
browser varchar(20),
device varchar(20),
os varchar(20),
timestamp bigint
)
Would it be better to create individual indexes on just the id columns, or also include the metric
and timestamp
columns in those indexes as well?
Upvotes: 0
Views: 112
Reputation: 142208
Do not normalize "continuous" values, such as DATETIME
, FLOAT
, INT
. Do leave the values in the main table.
When you move the value to other table(s), especially "snowflake", it makes querying based on the values somewhere between a little slower and a lot slower. This especially happens when you need to filter on more than one metric that is not in the main table. Either of these perform very poorly because of "snowflake" or "over-normalization":
WHERE a.x = 123 AND b.y = 345
ORDER BY a.x, b.y
As for what indexes to create -- that depends entirely on the queries you need to perform. So, I strongly recommend you sketch out the likely SELECTs
based on your tentative CREATE TABLEs
.
INT
is 4 bytes. TIMESTAMP
is 5, FLOAT
is 4, etc. That is, normalizing such things are also inefficient on space.
More
When doing JOINs
, the Optimizer will almost always start with one table, then move on to the another table, etc. (See "Nested Loop Join".)
For example (building on the above 'code'), when 2 columns are normalized, and you are testing on the values, you do not have two ids
in hand, you only have the two values. This makes the query execution very inefficient. For
SELECT ...
FROM main
JOIN a USING(a_id)
JOIN b USING(b_id)
WHERE a.x = 123 AND b.y = 345
The following is very likely to be the 'execution plan':
a
to find the row(s) with x=123; get the id(s)
for those rows. This may include many rows that are yet to be filtered by b.y
. a
needs INDEX(x)
main
table, looking up rows with those id(s). main
needs INDEX(a_id)
. Again, more rows than necessary may be hauled around.b
(using b_id
) to check for y=345
; toss the unnecessary rows you have been hauling around. b
needs INDEX(b_id)
Note my comment about "haul around". Blindly using *
(in SELECT *
) adds to the problem -- all the columns are being hauled around while performing the steps.
On the other hand... If x
and y
were in the main
table, then the code works like:
WHERE main.x = 123
AND main.y = 345
only needs INDEX(x,y)
(in either order). And it quickly locates exactly the rows desired.
In the case of ORDER BY a.x, b.y
, it cannot use any index on any table. So the query must create a tmp table, sort it, then deliver the rows in the desired order.
But if x
and y
are in the same table, then INDEX(x,y)
(in that order) may be useful for ORDER BY x,y
and avoid the tmp table and the sort.
With a single table, the Optimizer might use an index for WHERE
, or it might use an index for ORDER BY
, depending on the phase of the moon. In some cases, one index can be used for both -- this is optimal.
Another note: If you also have LIMIT 10
,... If the sort
is avoided, then only 10 rows need to be looked at, not the entire set from the WHERE
.
Upvotes: 1