Nixuz
Nixuz

Reputation: 3479

Improve performance of PostgreSQL array queries

I am storing large vectors (1.4 million values) of doubles in a PostgreSQL table. This table's create statement follows.

CREATE TABLE analysis.expression
(
  celfile_name character varying NOT NULL,
  core double precision[],
  extended double precision[],
  "full" double precision[],
  probeset double precision[],
  CONSTRAINT expression_pkey PRIMARY KEY (celfile_name)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE analysis.expression ALTER COLUMN core SET STORAGE EXTERNAL;
ALTER TABLE analysis.expression ALTER COLUMN extended SET STORAGE EXTERNAL;
ALTER TABLE analysis.expression ALTER COLUMN "full" SET STORAGE EXTERNAL;
ALTER TABLE analysis.expression ALTER COLUMN probeset SET STORAGE EXTERNAL;

Each entry in this table is written only once and possibly read many times at random indices. PostgreSQL doesn't seem to scale terribly well for lookups as the vector length grows even with STORAGE set to EXTERNAL (O(n)). This makes queries like the following, where we selected many individual values in the array, very, very slow (minutes - hours).

SELECT probeset[2], probeset[15], probeset[102], probeset[1007], probeset[10033], probeset[200101], probeset[1004000] FROM expression LIMIT 1000;

If there enough individual indices being pulled it can even be slower than pulling the whole array.

Is there any way to make such queries faster?

Edits

Upvotes: 5

Views: 4074

Answers (1)

Patrick
Patrick

Reputation: 32161

You store your data in a structured data management storage container (i.e. PostgreSQL), but due to the nature of your data (i.e. large but irregularly sized collections of like data) you actually store your data outside of the container. PostgreSQL is not good at retrieving data from irregular and unpredictable?) large arrays, as you have noticed; the fact that the arrays are stored externally is already testament to the fact that your requirements are not aligned with where PostgreSQL excels. It is very likely that there are much better solutions for storing and reading your arrays than PostgreSQL. Given that the results from analyzing the arrays through prediction models is stored in some tables in a PostgreSQL database hints at a hybrid solution: store your data in some form that allows efficient access in the patterns that you need, then store the results in PostgreSQL for further processing.

Since you do not provide any details on the prediction models, it is impossible to be specific in this answer, but I hope this will help you on your way.

If your prediction models are written in some language for which a PostgreSQL driver is available, then store your data in some format that is suited for that language, do your predictions and write the results to a table in PostgreSQL. This would work for languages like C and C++ with the pq library and for Java, C#, Python, etc using a high-level library like JDBC.

If your prediction model is written in MatLab, then store your arrays in a MatLab format and connect to PostgreSQL for the results. If written in R, you can use the R extension for PostgreSQL.

The key here is that you should store the arrays in a form that allows for efficient use in your prediction models. Match your data storage to the prediction models, not the other way around.

Upvotes: 4

Related Questions