Reputation: 81
Hopefully a simple question:
Would it be better to create 1) a single "document" table with tens of thousands of entries or 2) break them out into several "document_type" tables?
For example, 1) a "document" table with user_id, document_type and document_name columns or 2) separate "document_type" tables with user_id and document_name columns.
In either case, we are dealing with tens of thousands of entries.
My instincts tell me option 1 may possibly incur a significant performance hit vs option 2.
Thanks!
Upvotes: 1
Views: 182
Reputation: 12843
The only reason I see to split the table is if it should be modelled as subclasses of document.
That is, instead of:
document
- document_id (pk)
- type
- name
- attribute_x
- attribute_y
- attribute_z
- attribute_a
- attribute_b
- attribute_c
- attribute_1
- attribute_2
- attribute_3
You create a table for each subclass of document:
document
- document_id (pk)
- type
- name
document_type_1
- document_id (pk) references document(document_id)
- attribute_x
- attribute_y
- attribute_z
document_type_2
- document_id (pk) references document(document_id)
- attribute_a
- attribute_b
- attribute_c
document_type_3
- document_id (pk) references document(document_id)
- attribute_1
- attribute_2
- attribute_3
The only class of queries that becomes worse are of the kind "search all attributes of all documents". Almost all other usages become faster on average with the one-table-per-subclass because of the smaller row length (on average more rows will fit in the cache, and each disk read returns more rows).
Upvotes: 0
Reputation: 3055
Unless you expect this to grow to millions and millions of records and/or be very insert-heavy, there is no reason to split this up into more than one table, under any circumstance. The purpose of indexing in a database is to work around the large dataset problem.
In your case, suppose you have 90K entries with 30K of each of three types. If you index the document_type column, a query selecting one of the three types would be very nearly as fast as doing a select on a table that contained 30K entries of only the same type.
Further, because document ID will most likely be a numeric index with a high cardinality, assuming you index the column -- which you should, it should be the primary key -- selecting a record of a particular index will be just as fast on a table with 90K entries of three types as it would on a table of 30K entries of one type.
There are other reasons to shard data but they have to do with running complex queries, transactional inserts, table joins, and so on. In my experience, table designers often feel the need to shard things that shouldn't be sharded, which (as other answers have mentioned) leads to complexities that are unnecessary. The number one rule of development: keep it simple!
Upvotes: 1
Reputation: 8067
Tens of thousands entries is not that much in the world of relational databases provided the database is designed and indexed properly. Some of the things to consider, if you create several tables:
It will be harder to maintain the code
The performance of selects will suffer
Edit: improved formatting
Upvotes: 1
Reputation: 14457
Your performance hit for the first option shouldn't be too bad with proper indexing. It sounds like you will want to index document_name, then possibly one of the other fields. It depends somewhat on how much you will be inserting versus querying also; if inserts will be rare, you can afford more indexing.
Upvotes: 0