Shackrock
Shackrock

Reputation: 4701

How to handle user's data in MySQL/PHP, for large number of users and data entries

Let's pretend with me here:

PHP/MySQL web-application. Assume a single server and a single MySQL DB.

I have 1,000 bosses. Every boss has 10 workers under them. These 10 workers (times 1k, totaling 10,000 workers) each have at least 5 database entries (call them work orders for this purpose) in the WebApplication every work day. That's 50k entries a day in this work orders table.

Server issues aside, I see two main ways to handle the basic logic of the database here:

  1. Each Boss has an ID. There is one table called workorders and it has a column named BossID to associate every work order with a boss. This leaves you with approximately 1 million entries a month in a single table, and to me that seems to add up fast.

  2. Each Boss has it's own table that is created when that Boss signed up, i.e. work_bossID where bossID = the boss' unique ID. This leaves you with 1,000 tables, but these tables are much more manageable.


I know that at some point we have to bring in talks of multiple servers, and databases linked together... but again, let's focus on a single server here with a singly MySQL DB.

Upvotes: 3

Views: 2861

Answers (4)

tadman
tadman

Reputation: 211560

Scaling is often a case of experimentation as it's not always clear from the outset where the bottlenecks will be. Since you seem to have a pretty good idea of the kind of load the system will be under, one of the first things to do is capture this in a spreadsheet so you can work out some hypotheticals. This allows you do do a lot of quick "what if" scenarios and come up with a reasonable upper end for how far you have to scale with your first build.

For collecting large numbers of records there's some straight-forward rules:

  • Use the most efficient data type to represent what you're describing. Don't worry about using smaller integer types to shave off a few bytes, or shrinking varchars. What's important here is using integers for numbers, date fields for dates, and so on. Don't use a varchar for data that already has a proper type.
  • Don't over-index your table, add only what is strictly necessary. The larger the number of indexes you have, the slower your inserts will get as the table grows.
  • Purge data that's no longer necessary. Where practical delete it. Where it needs to be retained for an extended period of time, make alternate tables you can dump it into. For instance, you may be able to rotate out your main orders table every quarter or fiscal year to keep it running quickly. You can always adjust your queries to run against the other tables if required for reporting. Keep your working data set as small as practical.
  • Tune your MySQL server by benchmarking, tinkering, researching, and experimenting. There's no magic bullet here. There's many variables that may work for some people but might slow down your application. They're also highly dependent on OS, hardware, and the structure and size of your data. You can easily double or quadruple performance by allocating more memory to your database engine, for instance, either InnoDB or MyISAM.
  • Try using other MySQL forks if you think they might help significantly. There are a few that offer improved performance over the regular MySQL, Percona in particular.
  • If you query large tables often and aggressively, it may make sense to de-normalize some of your data to reduce the number of expensive joins that have to be done. For instance, on a message board you might include the user's name in every message even though that seems like a waste of data, but it makes displaying large lists of messages very, very fast.

With all that in mind, the best thing to do is design your schema, build your tables, and then exercise them. Simulate loading in 6-12 months of data and see how well it performs once really loaded down. You'll find all kinds of issues if you use EXPLAIN on your slower queries. It's even better to do this on a development system that's slower than your production database server so you won't have any surprises when you deploy.

The golden rule of scaling is only optimize what's actually a problem and avoid tuning things just because it seems like a good idea. It's very easy to over-engineer a solution that will later do the opposite of what you intend or prove to be extremely difficult to un-do.

MySQL can handle millions if not billions of rows without too much trouble if you're careful to experiment and prove it works in some capacity before rolling it out.

Upvotes: 1

Neta Meta
Neta Meta

Reputation: 4047

i had database size problem as well in one of my networks so big that it use to slow the server down when i run query on that table..

in my opinion divide your database into dates decide what table size would be too big for you - let say 1 million entries then calculate how long it will take you to get to that amount. and then have a script every that period of time to either create a new table with the date and move all current data over or just back that table up and empty it.

like putting out dated material in archives.

if you chose the first option you'll be able to access that date easily by referring to that table.

Hope that idea helps

Upvotes: -1

ronalchn
ronalchn

Reputation: 12335

If you use a single server, I don't think there is a problem with how big the table gets. It isn't just the number of records in a table, but how frequently it is accessed.

To manage large datasets, you can use multiple servers. In this case:

  • You can keep all workorders in a single table, and mirror them across different servers (so that you have slave servers)

  • You can shard the workorders table by boss (in this case you access the server depending on where the workorder belongs) - search for database sharding for more information

Which option you choose depends on how you will use your database.


Mirrors (master/slave)

Keeping all workorders in a single table is good for querying when you don't know which boss a workorder belongs to, eg. if you are searching by product type, but any boss can have orders in any product type.

However, you have to store a copy of everything on every mirror. In addition only one server (the master) can deal with update (or adding workorder) SQL requests. This is fine if most of your SQL queries are SELECT queries.


Sharding

The advantage of sharding is that you don't have to store a copy of the record on every mirror server.

However, if you are searching workorders by some attribute for any boss, you would have to query every server to check every shard.


How to choose

In summary, use a single table if you can have all sorts of queries, including browsing workorders by an attribute (other than which boss it belongs to), and you are likely to have more SELECT (read) queries than write queries.

Use shards if you can have write queries on the same order of magnitude as read queries, and/or you want to save memory, and queries searching by other attributes (not boss) are rare.


Keeping queries fast

Large databases are not really a big problem, if they are not overwhelmed by queries, because they can keep most of the database on hard disk, and only keep what was accessed recently in cache (on memory).

The other important thing to prevent any single query from running slowly is to make sure you add the right index for each query you might perform to avoid linear searches. This is to allow the database to binary search for the record(s) required.

If you need to maintain a count of records, whether of the whole table, or by attribute (category or boss), then keep counter caches.

When to get a new server

There isn't really a single number you can assign to determine when a new server is needed because there are too many variables. This decision can be made by looking at how fast queries are performing, and the CPU/memory usage of your server.

Upvotes: 3

Jared Drake
Jared Drake

Reputation: 1002

Just create a workers table, bosses table, a relationships table for the two, and then all of your other tables. With a relationship structure like this, it's very dynamic. Because, if it ever got large enough you could create another relationship table between the work orders to the bosses or to the workers.


You might want to look into bigints, but I doubt you'll need that. I know it that the relationships table will get massive, but thats good db design.

Of course bigint is for mySQL, which can go up to -9223372036854775808 to 9223372036854775807 normal. 0 to 18446744073709551615 UNSIGNED*

Upvotes: -2

Related Questions