Reputation: 61
I collect how people tag topics with categories in table like:
ID | topic_id | votes_Category_1 | votes_Category_2 |.......... | votes_Category_12
I dump this table every hour for history reasons. Lets say table contain 2 million rows. dumped every hour in history tables.
This solution is not flexible if I want to add column Category_13, so I'm thinking about this one:
ID | topic_id | Category_id | vote_count
This solution will create 12 rows per topic, its better structured and more flexible but I'll have to dump every hour 24 million rows.
I need the best 10 topics in each category! I wonder in Case 2 if using Max on votes (where category_id=x and topic_id=y) will be slower than in case 1: Order by categoy_x where topic_id=y
Which one would be better JUST!!! from performance stand point:
Thank you
Upvotes: 5
Views: 2369
Reputation: 24124
I would look at the retrieval patterns to decide on the approach.
If you retrieve the topics by category, then I would go with the second approach, define index on the category field so that all the records for a given category are stored contiguously (relatively) on the disk, resulting in less number of disk pages to be retrieved. This is also because of smaller record size compared to that of table with all categories as columns. The advantage is the flexibility to add more categories easily and the downside is repetition of (ID, TopicID) column data that affects the total size of the data.
If you retrieve by topic, then I would go with the first approach, defining an index on the topic. This would reduce the repetition of (ID, TopicID) column values for each category, thereby reducing the total size of the data to be stored and since the count of rows is in millions per hour, this reduction in size must be significant. The downside is the need to modify the schema for new categories.
Edit: Considering the retrieval patterns from your edit:
I retrieve top topics and their values per Category so i order by votes_Category_x in case 1.
I understand this as Find the top N topics with largest number of votes in a given category
In case 2 I would look to find max(category) for each topic_id.
and this as SELECT TopicID, MAX(votes) FROM TABLE GROUP BY TopicID, Category
.
The size of the record is different for 2million and 24million rows, but yes, the ID and TopicID are repeated which would definitely increase the data size, by 8 bytes for each record.
The first table stores 2 million records each of size 60 bytes (4*15 ints)
and the second table stores 24 million records each of size 16 bytes (4*4 ints)
. The second table would add ~62
pages of 4KB
each, per hour. Seems like a concern over a period of time. This will also affect the fragmentation due to insertion of data in the middle, as the index is organized by category in case of second approach.
It may be worth running some performance tests to understand this better and also weigh in the frequency of adding categories, before going ahead with one of the table structures.
Upvotes: 4