Reputation: 759
I need to know which database model is better for performance.
First Database Model
Three tables. Features, Products, Feature values.
Feature table is
+----+-----------+
| id | name |
+----+-----------+
| 1 | Brand |
| 2 | Color |
| 3 | Dimension |
| 4 | Model |
+----+-----------+
And Feature values table
+----+---------+------------+
| id | name | feature_id |
+----+---------+------------+
| 1 | Sony | 1 |
| 2 | Samsung | 1 |
| 3 | Red | 2 |
| 4 | Blue | 2 |
| 5 | 20 " | 3 |
| 6 | 30 " | 3 |
| 7 | Model A | 4 |
| 8 | Model B | 4 |
+----+---------+------------+
And products table.
+----+--------------------+----------+
| id | product_name | features |
+----+--------------------+----------+
| 1 | Sony Television | 1-3-5-7 |
| 2 | Samsung Television | 2-4-6-8 |
+----+--------------------+----------+
As you see in this structure, if a user wants to search products according to feature, I need to use REGEXP or fulltext search in my query.
Second Database Model
In the second database model, I will delete features from products table and I will add new table called product_features.
+----+--------------------+
| id | product_name |
+----+--------------------+
| 1 | Sony Television |
| 2 | Samsung Television |
+----+--------------------+
And new table product_features;
+----+------------+------------+
| id | feature_id | product_id |
+----+------------+------------+
| 1 | 1 | 1 |
| 2 | 3 | 1 |
| 3 | 5 | 1 |
| 4 | 7 | 1 |
| 5 | 2 | 2 |
| 6 | 4 | 2 |
| 7 | 6 | 2 |
| 8 | 8 | 2 |
+----+------------+------------+
Now if a user wants to search products according to feature, I need to search product_features and then join products.
Question
My question is if I use second model and I have over 20 billion rows in products table, and think there are at least 10 features for every product, So product_features table will have over 200 billions rows. And maybe queries according to features will be slow.
If I use first model, when user search according to features, I have to query with fulltext search or REGEXP in 20 billions rows.
I don't know which way is better ? What is your suggestion ?
Upvotes: 1
Views: 1381
Reputation: 647
As Rockse said, you should stick with the 2nd model. As for large data sets, once you grow to a certain size that starts getting too big for one database instance, you need to start scaling "horizontally" (across several instances). One common way to scale such really large data sets is called "sharding": split your data set into subsets and store them on different database servers. Then come up with an algorithm that will tell your application which database to go to to fetch information about a certain product.
For example, let's split the data set into 4 chunks, ~5billion rows each. Then use the "product_id % 4" (that's modulo 4) as the "key" that will tell you which database instance contains information about that particular product. A very rough pseudo code might would look like this:
connections = []
function initConnections() {
... connect to 4 different databases or create pools ...
connections = [conn1, conn2, conn3, conn4];
}
function getProductDbConnection(productId) {
return connections[productId%4];
}
function getProductFeatures(productId) {
conn = getProductDbConnection(productId);
... run whatever queries you need to get features ...
}
Here is an article that talks about how Instagram sharded their data to keep up with demand: http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram
Upvotes: 1
Reputation: 4166
First Model
It is not even in 1NF form, as it has non atomic values in features attribute. Moreover it would be really difficult to add, update or delete any new features in products table. So it won't be feasible at all.
Second Model
It is normalized till 5NF and looks good, for optimizing search use subquery and use indexing on product_id and feature_id. Try avoid using JOIN in such large data.
Upvotes: 1