Reputation: 2738
I'm currently designing the structure of my database. Planning to have a huge table (let's call it Clicks table) with hundreds of millions of rows. Many of its columns will be referenced with a foreign key in other tables to reduce the size of this huge table and to reduce query time.
In those other 'reference tables' I'm planning to store most of the data about clicks. So when I look up the Clicks table I'll just simply JOIN a few of these reference tables to get what I want to know about clicks.
1st question: is this a good practice to do in terms of speed - if I'm going to do lots of selects on this huge Clicks table later on?
These smaller reference tables will have a few thousands of rows mostly with 1 column with a type of string. These strings will be somewhere between 5-50 characters long.
What I'm planning to do is when there's a click, I'll check these small tables if the same value exists already or not, and if not, then I'll insert them.
This would require a SELECT.
2nd question: Is it better to perform searches on the string itself and index it, or shall I have another column with the MD5 result of the string and look up the MD5 string (with index) instead? In other words, does the string's size affects the length of looking up a string in a simple select?
I'm planning to do SELECTs like this:
SELECT id FROM table1 WHERE string = $string
Is there any better way to achieve any of the above?
Upvotes: 0
Views: 115
Reputation: 211540
If you're hashing these then it's probable the hash itself will be longer than the strings you're hashing, making it counter-productive. You'll want to hash for things that are consistently larger and often by an order of magnitude or more. A 7KB JSON string, for example, is a good candidate. Computing the hash and looking that up in the index would be faster than comparing strings in an index.
What you need to do is prototype this, fill it up with a representative amount of data, and see how it performs. Your database will need to be tuned to handle your workload, and your schema will need to be exercised to the breaking point so you know how much data you can handle before your approach melts down.
Maybe that breaking point is 100 million records. Maybe it's 50 billion. Nobody knows how it will perform on your hardware, and only you can find out by testing.
Upvotes: 1
Reputation: 1269443
Your design sounds fine. You want a secondary index on the string in each of the reference tables.
Your description is unclear whether you are doing this one "click" at a time or in batch.
I would recommend a batch approach for this operation, unless you have a compelling need for real-time data. If you do need real-time data, I would tend to advocate as "streaming-ish" approach, where new data is added by inserting onto existing tables and never with updates.
If you are updating millions of rows each day, individually, then the locking operations during peak processing could get costly. If the table is being used for analytics or reporting, the query loads from that processing could also interfere with the updates.
Upvotes: 1