user2926426
user2926426

Reputation:

Web Application Database - Optimum performance and scalability

I am currently working on building a web application where each customer is able to create several projects. For each project, the user will upload thousands of rows of data (1,000-50,000), lets call them 'posts'.

Is it better to create a single project database where each row contains its project ID, create a database per customer where each row contains its project ID or a database per project.

'Better' is defined as better performance and ability to scale. All of the posts from each project will be used in machine learning algorithms with thousands of features (5,000-10,000) and displayed to users to mark as positive or negative. We are yet to decide a database.

Upvotes: 0

Views: 141

Answers (2)

Nadeem
Nadeem

Reputation: 324

Looking at your brief question you have three entities: N User , (N * 1000) Project, (N * (N*1000) * 50000 )Posts

Based on my assumption that are talking about machine learning stuff and sclability, so I can safely assume a minimum of 100,000 users. Each user can create several projects let's say 1000. Each projects can have 50,000 Posts.

100,000 Users * 1000 Project * 50,000 Posts.

RDBMS solution would be to Create User table(UserId, Name etc..), Create Project Table (ProjectId, etc..) Create Mapping table for UserProject (UserProjectId, UserId, ProjectId) Create Posts table (PostId, PostContent..., PostDate)

The last table really depends on the actual data and read/write frequency. If you just want read performance with little insert operation it can be pretty achievable in any modern RDBMS, multiple option:

  • One table UserPosts(UserProjectId, PostId) This is going to be huge table will definitely cross billions of rows but it can handle it.
  • You can create multiple UserPosts table for the range of UserProjectId depending on the volume, then create a View/Function on top of it to decide which table to pick data based on UserProjectId.

Upvotes: 0

Hazzit
Hazzit

Reputation: 6882

If I understand your application right, you are currently comparing two solutions:

  1. Create N databases, each containing 1 table, containing 1 project's data
  2. Create 1 database, containing 1 table, containing N projects' data (requiring an extra column for the project id)

There is a third possibility, which is

  1. Create 1 database, containing N tables, each containing 1 project

Performance: The performance difference between (1) and (3) tends to be insignificant if you are only accessing data of one project at any given time, (and all run on the same server in the case of (1)). (3) tends to be faster and easier if you access more than one project at once. There is a ton of information on StackOverflow comparing the other two options (2) and (3), single-table vs. mutiple-table. Most of the time, the performance difference is minor with good indexes in place, but not insignificant.

Scalability: You state that each project has in the order of thousands of rows. You don't state how many projects there might be. You also don't state how often that data needs to be requested or changed. So all of the following ist just a guideline: Modern servers can easily handle millions of rows (at 1k/row that's just Gigabytes) in a single database instance and easily serve thousands of requests per minute (if properly designed). They start to struggle when it comes to billions of rows or more (multiple Terabytes of data), or multiple thousands of requests per second. It depends a lot on the design of the database itself when exactly you'll hit the metaphorical wall, but once you hit a size where one database instance cannot handle it anymore for either size or speed reasons, you'll have to scale either horizontally or vertically. Horizontal scaling (i.e. adding more servers) is easier and less expensive with multiple databases. Separating your projects into different database may make this easier, but such an easy solution will very likely end in lots of servers just idleing around while others are running hot. Most modern database systems allow a single database to be split across multiple machines even within one database.

In the end, you didn't specify enough to really answer your question. And from the way you wrote it, I'm guessing you currently are not really in the position to make a final decision yet - you're only trying not to shoot yourself into your own foot. So here is a quick guide to

How to Not Shoot Yourself in the Foot

  • Separate all database access code from into a single unit of code.
  • Keep all program logic outside of that code
  • Keep that piece of code well maintained and documented
  • Let somebody with knowledge of multiple DBMS look at your queries to make sure they're easily portable

If you do that, you'll be able to later change database systems or change partitioning of data by only modifying that one file. You can even benchmark different designs by just exchanging that single code file.

Upvotes: 2

Related Questions