Reputation: 3479
I am building a database for microarray data. Each patient sample has over 1,000,000 features and I would like to store the patient samples as rows in an SQL table with each feature as a column.
HuEX Microarray Data
+----+----------+----------+-----+------------------+
| ID | Feature1 | Feature2 | ... | Feature1,000,000 |
+----+----------+----------+-----+------------------+
| 1 | 2.3543 | 10.5454 | ... | 5.34333 |
| 2 | 13.4312 | 1.3432 | ... | 40.23422 |
+----+----------+----------+-----+------------------+
I know most relational database systems have limits on the number of columns in a table.
+------------+-----------------+
| DBMS | Max Table Col # |
+------------+-----------------+
| SQL Server | 1,024 - 30,000 |
| MySQL | 65,535 bytes |
| PostgreSQL | 250 - 1,600 |
| Oracle | 1,000 |
+------------+-----------------+
Obviously these limitations are too low for my task. Is there anyway to increase the number of columns an SQL database table can have or is there another DBMS that can handle such high number of table columns?
Update
Note all the columns will have values for all the rows.
Upvotes: 8
Views: 1365
Reputation:
Well, going with the new information that this is dense array of homogenous numeric (double) values and querying is important (that is, I will disregard de-normalization into blobs/XML and the use of special UDFs), I propose the following:
Split each result into multiple records, where each record is of the form:
ID, SEGMENT, IDx ... // where x is [0, q]
The value of q
is arbitrary but should be chosen based upon particular database implementation (e.g. try to fit into the 8k record size in SQL Server) for performance/efficiency reasons.
Each result will then be split into the records such that SEGMENT
refers to the segment. That is the "absolute index" of a given feature is n = SEGMENT * q + x
and feature n
will be found in the record where SEGMENT = n / q
. It then follows that the Primary Key is (ID, SEGMENT)
.
Thus querying is still easy -- the only change is the conversion to/from the segment -- with the only additional requirement being the SEGMENT
(this column may also participate in an index).
(A separate table can be used to map features to SEGMENT/x
or otherwise. In this way it is similar to an EAV Model.)
Thus, while similar in some ways to a fully normlized form it takes advantage of the packed/homogenous/static-feature nature of the initial matrix to significantly reduce the number of records -- while 2 million records is an arguably small table and 20 million records is only a "mid-sized" table, 200 million records (results from 200 chips x 1 million features per chip, if each feature results in a record) starts to become daunting. While the same complexity, a q
of 200 would reduce the number of records to a mere 10 million. (Each compacted record is also much more efficient in terms of data/structure ratio.)
Happy coding.
While the above is one tentative "what if" suggestion on my part, I would encourage exploring the problem more -- in particular, the exact data-access patterns required. I'm not sure this is a "typical" usage of a standard RDBMS and an RDBMS may not even be a good way to approach this problem.
Upvotes: 2
Reputation:
This is actually a use case for an Entity-Attribute-Value Model (EAV), and may actually be better suited to non RDBMS/SQL solutions in some intense environments. (Relational database are workhorses though ... might as well use one until it's provably not sufficient ;-)
From the Wikipedia article:
Entity-attribute-value model (EAV) is a data model to describe entities where the number of attributes (properties, parameters) that can be used to describe them is potentially vast, but the number that will actually apply to a given entity is relatively modest. In mathematics, this model is known as a sparse matrix.
Happy coding.
Upvotes: 2
Reputation: 91209
Try rearranging your table to:
CREATE TABLE MicroarrayData (
SampleID INTEGER,
FeatureID INTEGER,
Value REAL,
PRIMARY KEY (SampleID, FeatureID)
);
Upvotes: 4
Reputation: 238256
You'd normally split (normalize) the tables:
Sample: ID, PatientID
Feature: ID, Name
SampleFeature: SampleID, FeatureID, value
SQL databases can't handle a lot of columns, but they can handle a lot of rows.
Upvotes: 4
Reputation: 888027
Don't.
Event if you can make it work, it will be very slow and unwieldly.
Instead, you should make a separate table with columns for PatientID
, Feature
, and Value
.
This table would have one row for each cell in your proposed table.
It also makes it possible to add additional information about each patient-feature pair.
Upvotes: 13