kirugan
kirugan

Reputation: 2624

Sharding database by user_id vs by entity_id

My current employee has a huge table of items. Each item has user_id and obviously item_id properties. To improve performance and high availability my team decided to shard the table. We are discussing two strategies:

Shard by item_id

In terms of high availability if shard is down then all users lost temporary 1/N of items. The performance will be even across all shards (random distribution)

Shard by user_id

If shard is down then 1 of N users won't be able to access their items. Performance might be not even cause we have users with 1000s items as well as users with just one item. Also, there is a big disadvantage - now we need to pass item_id and user_id in order to access an item.

So my question is - which one to choose? Maybe you can guide me with some mathematical formula to decide which one is better in different circumstances

P.S. We already have replicas but they are becoming useless for our write throughput

UPDATE

We have serp pages where we need get items by ids as well as pages like user profile where the user wants to see his/her items. The first pattern is the most frequently used, unlike the second one. We can give up easily on ACID transactions because we've started to build microservices (so eventually almost all big entities will be encapsulated in specific microservice).

Upvotes: 4

Views: 2138

Answers (2)

Eric M. Johnson
Eric M. Johnson

Reputation: 7337

I see a couple of ways to attack this:

  • How do you intend to shard? Separate master servers, separate schemas serviced by the same server but by different storage backgrounds?
  • How do you access this data? Is it basically key/value? Do you need to query all of a user's items at once? How transactional do your CRUD operations need to be?
  • Do you foresee unbalanced shards being a problem, based on the data you're storing?
  • Do you need to do relational queries of this data against other data in your system?

TradeOffs

  • If you split shards across server/database instance boundaries, sharding by item_id means you will not be able to do a single query for info about a single user_id... you will need to query every shard and then aggregate the results at the application level. I find the aggregation has a lot more pitfalls than you'd think... better to keep this in the database.
  • If you can use a single database instance, sharding by creating tables/schemas that are backed by different storage subsystems would allow you to scale writes will still being able to do relational queries across them. All of your eggs are still in 1 server basket with this method, though.
  • If you shard by user_id, and you want to rebalance your shards by moving a user to another shard, you will need to atomically move all of the user's rows at once. This can be difficult if there are lots of rows. If you shard by item_id, you can move one item at a time. This allows you to incrementally rebalance your shards, which is awesome.
  • If you intend to split these into separate servers such that you cannot do relational queries across schemas, it might be better to use a key/value store as DynamoDB. Then you only have to worry about one endpoint, and the sharding is done at the database layer. No middleware to determine which shard to use!

The key tradeoff seems to be the ability to query about all of a particular user's data (sharding by user_id), vs easier balancing and rebalancing of data across shards (sharding by item_id).

I would focus on the question of how you need to store and access your data. If you truly only need access by item_id, then shard by item_id. Avoid splitting your database in ways counterproductive to how you query it.

If you're still unsure, note that you can shard by item_id and then choose to shard by user_id later (you would do this by rebalancing based on user_id and then enforcing new rows only getting written to the shard their user_id belongs to).

Based on your update, it sounds like your primary concerns are not relational queries, but rather scaling writes to this particular pool of data. If that's the case, sharding by item_id allows you the most flexibility to rebalance your data over time, and is less likely to develop hot spots or become unbalanced in the first place. This comes at the price of having to aggregate queries based on user_id across shards, but as long as those "all items for a given user" queries do not need consistency guarantees, you should be fine.

Upvotes: 3

SergeyLebedev
SergeyLebedev

Reputation: 3708

I'm afraid that there is no any formula that can calculate the answer for all cases. It depends of your data schema, and of your system functional requirements.

If in your system separate item_id has sensible meaning and your users usually work with data from separate item_id's (like Instagram like service when item_id's are related to user photos), I would suggest you sharding by item_id because this choice has lot of advantages from the technical point of view:

  • ensures even load across all shards
  • ensures graceful degradation of your service: when shard is down users lose access to 1/N of their items, but they can work with other items
  • you do not have to pass user_id to access item_id

There are also some disadvantages with this approach. For example, it will be more difficult to backup all items of a given user.

When only complete item_id series can have sensible meaning, it is more reasonable to shard by user_id

Upvotes: 2

Related Questions