Reputation: 2624
My current employee has a huge table of items. Each item has user_id
and obviously item_id
properties. To improve performance and high availability my team decided to shard the table.
We are discussing two strategies:
Shard by item_id
In terms of high availability if shard is down then all users lost temporary 1/N of items. The performance will be even across all shards (random distribution)
Shard by user_id
If shard is down then 1 of N users won't be able to access their items. Performance might be not even cause we have users with 1000s items as well as users with just one item. Also, there is a big disadvantage - now we need to pass item_id and user_id in order to access an item.
So my question is - which one to choose? Maybe you can guide me with some mathematical formula to decide which one is better in different circumstances
P.S. We already have replicas but they are becoming useless for our write throughput
UPDATE
We have serp pages where we need get items by ids as well as pages like user profile where the user wants to see his/her items. The first pattern is the most frequently used, unlike the second one. We can give up easily on ACID transactions because we've started to build microservices (so eventually almost all big entities will be encapsulated in specific microservice).
Upvotes: 4
Views: 2138
Reputation: 7337
I see a couple of ways to attack this:
TradeOffs
item_id
means you will not be able to do a single query for info about a single user_id
... you will need to query every shard and then aggregate the results at the application level. I find the aggregation has a lot more pitfalls than you'd think... better to keep this in the database.user_id
, and you want to rebalance your shards by moving a user to another shard, you will need to atomically move all of the user's rows at once. This can be difficult if there are lots of rows. If you shard by item_id
, you can move one item at a time. This allows you to incrementally rebalance your shards, which is awesome.The key tradeoff seems to be the ability to query about all of a particular user's data (sharding by user_id
), vs easier balancing and rebalancing of data across shards (sharding by item_id
).
I would focus on the question of how you need to store and access your data. If you truly only need access by item_id
, then shard by item_id
. Avoid splitting your database in ways counterproductive to how you query it.
If you're still unsure, note that you can shard by item_id
and then choose to shard by user_id
later (you would do this by rebalancing based on user_id
and then enforcing new rows only getting written to the shard their user_id
belongs to).
Based on your update, it sounds like your primary concerns are not relational queries, but rather scaling writes to this particular pool of data. If that's the case, sharding by item_id
allows you the most flexibility to rebalance your data over time, and is less likely to develop hot spots or become unbalanced in the first place. This comes at the price of having to aggregate queries based on user_id
across shards, but as long as those "all items for a given user" queries do not need consistency guarantees, you should be fine.
Upvotes: 3
Reputation: 3708
I'm afraid that there is no any formula that can calculate the answer for all cases. It depends of your data schema, and of your system functional requirements.
If in your system separate item_id
has sensible meaning and your users usually work with data from separate item_id
's (like Instagram
like service when item_id
's are related to user photos), I would suggest you sharding by item_id
because this choice has lot of advantages from the technical point of view:
user_id
to access item_id
There are also some disadvantages with this approach. For example, it will be more difficult to backup all items of a given user.
When only complete item_id
series can have sensible meaning, it is more reasonable to shard by user_id
Upvotes: 2