RJD22
RJD22

Reputation: 10340

Best database design for storing a high number columns?

Situation: We are working on a project that reads datafeeds into the database at our company. These datafeeds can contain a high number of fields. We match those fields with certain columns.

At this moment we have about 120 types of fields. Those all needs a column. We need to be able to filter and sort all columns.

The problem is that I'm unsure what database design would be best for this. I'm using MySQL for the job but I'm are open for suggestions. At this moment I'm planning to make a table with all 120 columns since that is the most natural way to do things.

Options: My other options are a meta table that stores key and values. Or using a document based database so I have access to a variable schema and scale it when needed.

Question: What is the best way to store all this data? The row count could go up to 100k rows and I need a storage that can select, sort and filter really fast.

Update: Some more information about usage. XML feeds will be generated live from this table. we are talking about 100 - 500 requests per hours but this will be growing. The fields will not change regularly but it could be once every 6 months. We will also be updating the datafeeds daily. So checking if items are updated and deleting old and adding new ones.

Upvotes: 1

Views: 727

Answers (2)

vyegorov
vyegorov

Reputation: 22855

I'm working on a similar project right now, downloading dumps from the net and loading them into the database, merging changes into the main table and properly adjusting the dictionary tables.

First, you know the data you'll be working with. So it is necessary to analyze it in advance and pick the best table/column layout. If you have all your 120 columns containing textual data, then a single row will take several K-bytes of disk space. In such situation you will want to make all queries highly selective, so that indexes are used to minimize IO. Full scans might take significant time with such a design. You've said nothing about how big your 500/h requests will be, will each request extract a single row, a small bunch of rows or a big portion (up to whole table)?

Second, looking at the data, you might outline a number of columns that will have a limited set of values. I prefer to do the following transformation for such columns:

  • setup a dictionary table, making an integer PK for it;
  • replace the actual value in a master table's column with PK from the dictionary.

The transformation is done by triggers written in C, so although it gives me upload penalty, I do have some benefits:

  • decreased total size of the database and master table;
  • better options for the database and OS to cache frequently accessed data blocks;
  • better query performance.

Third, try to split data according to the extracts you'll be doing. Quite often it turns out that only 30-40% of the fields in the table are typically being used by the all queries, the rest 60-70% are evenly distributed among all of them and used partially. In this case I would recommend splitting main table accordingly: extract the fields that are always used into single "master" table, and create another one for the rest of the fields. In fact, you can have several "another ones", logically grouping data in a separate tables.

In my practice we've had a table that contained customer detailed information: name details, addresses details, status details, banking details, billing details, financial details and a set of custom comments. All queries on such a table were expensive ones, as it was used in the majority of our reports (reports typically perform Full scans). Splitting this table into a set of smaller ones and building a view with rules on top of them (to make external application happy) we've managed to gain a pleasant performance boost (sorry, don't have numbers any longer).

To summarize: you know the data you'll be working with and you know the queries that will be used to access your database, analyze and design accordingly.

Upvotes: 0

Cylindric
Cylindric

Reputation: 5894

120 columns at 100k rows is not enough information, that only really gives one of the metrics: size. The other is transactions. How many transactions per second are you talking about here?

Is it a nightly update with a manager running a report once a week, or a million page-requests an hour?

I don't generally need to start looking at 'clever' solutions until hitting a 10m record table, or hundreds of queries per second.

Oh, and do not use a Key-Value pair table. They are not great in a relational database, so stick to proper typed fields.

I personally would recommend sticking to a conventional one-column-per-field approach and only deviate from this if testing shows it really isn't right.

With regards to retrieval, if the INSERTS/UPDATES are only happening daily, then I think some careful indexing on the server side, and good caching wherever the XML is generated, should reduce the server hit a good amount. For example, you say 'we will be updating the datafeeds daily', then there shouldn't be any need to query the database every time. Although, 1000 per hour is only 17 per minute. That probably rounds down to nothing.

Upvotes: 1

Related Questions