Efficient ways to store multiple excel files in a database?

Question

We're working on a big inner project which enables users to upload excel files, and eventually perform searches on all the data that is gathered from these excels. I'm trying to do my homework before we begin the design, and come up with the best solution.

The requirements are -

User can upload an excel file with as many columns as he wants, so that there's no predefined structure for the excel.
In contrary to the first point, there are a few fields we assume the user to have. For example - First name, last name. These colums don't have to exist.
The search option will work as following - When the user searches, he can search by specific columns - the ones that were predefined and we expected his excel file to have. (In our example - First and last name). He can also search all the other columns, under the field "Other".

Another word about the Other search field - This field will go through all the columns in all the excel files that don't fit the predefined columns. I.E - One file has an age column, another has a birthplace column, the "other" field will search through all those columns.

What's the best way to do this?

Dynamically create a new django model for each excel uploaded, with all the columns the excel has?
Dynamically create a new django model for each file, with all the predefined columns (if they exist!), and a "other" text field, which will concatenate all unrelated fields?
Have one big django model (means only 1 table in my database) which has all my predefined fields (which again, can be null), and also a field called "others" which will concatenate all the unrelated columns?
I can have my main table which has all the predefined columns, and another table with a foreign key to the main table, where each row represent an "other" field.

Example for the 4th solution -

+----+--------+--------+--------+
| id | field1 | field2 | field3 |
+----+--------+--------+--------+
|  1 | val1   | val1   | val1   |
|  2 | val2   | val2   | val2   |
|  3 | val3   | val3   | val3   |
+----+--------+--------+--------+

And the dimension table -

+----+------+------+
| fk | key  | val  |
+----+------+------+
|  1 | key1 | val1 |
|  1 | key2 | val2 |
|  1 | key3 | val3 |
|  2 | key4 | val4 |
+----+------+------+

As for Scaling - We expect to eventually have no more than 1500 excel files, each containing between 100 to approx. 100,000 rows (We will probably limit the number of rows each excel file has to 100k). Statistics we have from the excels examined say that we won't go over the 30~ million rows.

We will be using Django with either MySQL or PostgreSQL.

I hope my question was clear and not too opaque.

Thanks!

Hans Then · Accepted Answer

EDIT: After you changed your question. I have added a short section on your model 4.

I would strongly recommend against dynamically creating tables. This is messy and I doubt it will perform well. Your database will create an access path for each database table you will query, so if you create multiple databases files you will need to search all of them.

You probably need a variant of your model 3.

This means you use one table, but instead using columns for each field, you create two columns one for the excel column name and one for it's value. You will also need some additional entries to identify which excel column and values belong to which excel spreadsheet.

So conceptually, instead of modelling:

field1 field2 field3 field4 other
------------------------------------
x       y     z       a     etc=xyz

You model it like this:

sheet fieldname value
------------------------------------
key   field1    x
key   field2    y
key   field3    z
key   field4    a
key   etc       xyz

The advantage of this model is that programming your searches become easier. You can model any search simply as a select * from data where fieldname='%s' and value='%s'. If you create a database index on fieldname (and probably an index on the key you use to identify the excel sheets), there should be no performance penalty over your original ideas for model 3.

Your model 4 would also work. It has the advantage that, for the predefined fields, your user's query statements would easily map to SQL select statements. It has the disadvantage that you need to handle your "others" columns differently from the rest of the user's search criteria. You also indicated that users sometimes do not enter the columns that you expect to be there. This means you have to make these columns nullable, which increases the storage requirements.

Overall, I think that my suggested approach it better than your option 4, as it conceptually simpler. You indicated that you thought it would create too many rows. Indeed that would create more rows, but MySQL and PostgresSQL can easily that amount of rows. PostgresSQL can store an unlimited number of rows. MySQL can store 4000~ million rows (and you can compile MySQL with --big-tables if you need more).

In terms of performance, it makes no real difference how big your table is, as long as you have an index on the field.

Efficient ways to store multiple excel files in a database?

The requirements are -

What's the best way to do this?

Answers (1)

Related Questions