Reputation:
I am currently working on building a web application where each customer is able to create several projects. For each project, the user will upload thousands of rows of data (1,000-50,000), lets call them 'posts'.
Is it better to create a single project database where each row contains its project ID, create a database per customer where each row contains its project ID or a database per project.
'Better' is defined as better performance and ability to scale. All of the posts from each project will be used in machine learning algorithms with thousands of features (5,000-10,000) and displayed to users to mark as positive or negative. We are yet to decide a database.
Upvotes: 0
Views: 141
Reputation: 324
Looking at your brief question you have three entities:
N User , (N * 1000) Project, (N * (N*1000) * 50000 )Posts
Based on my assumption that are talking about machine learning stuff and sclability, so I can safely assume a minimum of 100,000 users. Each user can create several projects let's say 1000. Each projects can have 50,000 Posts.
100,000 Users * 1000 Project * 50,000 Posts.
RDBMS solution would be to
Create User table(UserId, Name etc..),
Create Project Table (ProjectId, etc..)
Create Mapping table for UserProject (UserProjectId, UserId, ProjectId)
Create Posts table (PostId, PostContent..., PostDate)
The last table really depends on the actual data and read/write frequency. If you just want read performance with little insert operation it can be pretty achievable in any modern RDBMS, multiple option:
Upvotes: 0
Reputation: 6882
If I understand your application right, you are currently comparing two solutions:
There is a third possibility, which is
Performance: The performance difference between (1) and (3) tends to be insignificant if you are only accessing data of one project at any given time, (and all run on the same server in the case of (1)). (3) tends to be faster and easier if you access more than one project at once. There is a ton of information on StackOverflow comparing the other two options (2) and (3), single-table vs. mutiple-table. Most of the time, the performance difference is minor with good indexes in place, but not insignificant.
Scalability: You state that each project has in the order of thousands of rows. You don't state how many projects there might be. You also don't state how often that data needs to be requested or changed. So all of the following ist just a guideline: Modern servers can easily handle millions of rows (at 1k/row that's just Gigabytes) in a single database instance and easily serve thousands of requests per minute (if properly designed). They start to struggle when it comes to billions of rows or more (multiple Terabytes of data), or multiple thousands of requests per second. It depends a lot on the design of the database itself when exactly you'll hit the metaphorical wall, but once you hit a size where one database instance cannot handle it anymore for either size or speed reasons, you'll have to scale either horizontally or vertically. Horizontal scaling (i.e. adding more servers) is easier and less expensive with multiple databases. Separating your projects into different database may make this easier, but such an easy solution will very likely end in lots of servers just idleing around while others are running hot. Most modern database systems allow a single database to be split across multiple machines even within one database.
In the end, you didn't specify enough to really answer your question. And from the way you wrote it, I'm guessing you currently are not really in the position to make a final decision yet - you're only trying not to shoot yourself into your own foot. So here is a quick guide to
How to Not Shoot Yourself in the Foot
If you do that, you'll be able to later change database systems or change partitioning of data by only modifying that one file. You can even benchmark different designs by just exchanging that single code file.
Upvotes: 2