Reputation: 2574

Most Efficient (Storage Space) Way To Store Histograms in SQL

I need to store a billion histograms in SQL. These histograms have identical buckets but can have very large variation in their counts, however, most buckets are 0 a lot of the time.

My initial attempt was to have a row per histogram where each column would represent a bucket.

I have been very careful with my datatypes but still the table looks like exceeding the storage allocated for it.

I was wondering if anyone has come across an efficient solution to storing ranges of values (where 0 is the most common occurrence) in MS SQL before I have to make a request for more hardware.

Thanks in advance.

Tot.

Upvotes: 2

Answers (4)

simba.wei

Reputation: 11

Actually, we can solve this question by creating only one table. If create more than one tables, we must use join operator. It is ineffective to get the histogram we want when we need to use it.

CREATE TABLE HISTOGRAM_VALUE
{
  HISTOGRAM_ID INT,
  BUCKET_ID INT,
  BUCKET_MIN_VALUE INT,  //or whatever value type you want
  BUCKET_HEIGHT INT,
  // other metadata
  PRIMARY KEY(HISTOGRAM_ID,BUCKET_ID,BUCKET_MIN_VALUE)
};

the BUCKET_MIN_VALUE is the min_value (or we can understand the left boundary of a bucket range) of each bucket.

Upvotes: 1

Branko Dimitrijevic

Reputation: 52137

Are histograms atomic from the data management point of view? What I mean by that is: do you always read or write the whole histogram as an indivisible unit in the database?

If yes, just serialize it to a BLOB. You can even swipe it through some compression library before writing to BLOB, for good measure.

If no, consider using something like this:

CREATE TABLE HISTOGRAM (
    HISTOGRAM_ID int PRIMARY KEY
    -- Other fields...
);

CREATE TABLE HISTOGRAM_VALUE (
    HISTOGRAM_ID int REFERENCES HISTOGRAM (HISTOGRAM_ID),
    BUCKET_NO smallint,
    VALUE decimal NOT NULL, -- Or whatever type is appropriate.
    PRIMARY KEY (HISTOGRAM_ID, BUCKET_NO)
);

(NOTE: If you are absolutely positively sure you'll never need more than 256 buckets, you could even use tinyint for BUCKET_NO, to squeeze-out some more space efficiency.)

Keep in mind that InnoDB tables are always clustered, so the HISTOGRAM_VALUE table above is just a single B-tree, with no table heap nor other B-trees (since there are no secondary indexes - the FOREIGN KEY can be satisfied directly off the primary index). This is about as efficient storage-wise as you can get with an InnoDB table.

To save space, simply omit buckets with 0 value, except when the histogram starts or ends with such a bucket. For example...

0   0   14.7    -12.9   0   0   55.1    0   0   0

...could be represented as:

HISTOGRAM_ID    BUCKET_NO    VALUE
1               1            0
1               3            14.7
1               4            -12.9
1               7            55.1
1               10           0

Upvotes: 2

naomi

Reputation: 1984

I would never dream of suggesting this under any other circumstances, but since space is the overriding issue here, you might want to experiment with it ...

It might be efficient to store each histogram in a single varchar field, with the amount in each bucket separated by some delimiter, e.g.

"1,,23,,,789789789" means 1 in the first bucket, 0 in the second and so on.

Upvotes: 1

Cade Roux

Reputation: 89721

CREATE TABLE Histogram (
    HistogramID BIGINT /* INT only goes to 2bn */ IDENTITY NOT NULL CONSTRAINT PK_Histogram PRIMARY KEY
    -- Other metadata like the date and time or whatever
)

CREATE TABLE Bucket (
    BucketID INT /* or smaller */ IDENTITY NOT NULL CONSTRAINT PK_Bucket PRIMARY KEY
    -- Other metadata like the range it applies to
)

CREATE TABLE HistogramValue (
    HistogramID BIGINT NOT NULL
    ,BucketID INT NOT NULL
    ,Counter BIGINT /* or smaller datatype */ NOT NULL
    ,CONSTRAINT PK_HistogramValue PRIMARY KEY (HistogramID, BucketID)
    ,CONSTRAINT FK_Histogram FOREIGN KEY REFERENCES Histogram(HistogramID)
    ,CONSTRAINT FK_Bucket FOREIGN KEY REFERENCES Bucket(BucketID)
)

The HistogramValue table will be sparse. You can left join from the Bucket table to the HistogramValue table for a particular histogram to get the "entire" histogram:

SELECT b.Range
       ,COALESCE(hv.Counter, 0) AS Counter
FROM Bucket b
LEFT JOIN HistogramValue hv
    ON hv.HistogramID = @HistogramID
    AND hv.BucketID = b.BucketID

This is a typical normalized model which would be relatively easy to maintain, load and export.

Upvotes: 7

Most Efficient (Storage Space) Way To Store Histograms in SQL

Answers (4)

Related Questions