Need a highly compressed datastore for Crawl data and log data

Question

I have to store a lot of crawl and log data in a Datastore with an efficient compression ratio.

So far I tried and installed Cassandra, Couchbase, Mysql and an FlatFile format and read the architectual overview of Google Big Table, Hypertable and the LevelDB File Layout.

Cassandra and Couchbase are about 1/5 in disk size of the uncompressed Mysql Database, but I want better results.

So I need a Simple Data Store with high compression features as in vertica, teradata, oracle and sqlserver products. (Page level compression)

The actual flatFile dataSet looks like

/oil_type/gas_station/2014-03/2014-03-05-23.csv
/oil_type/gas_station/2014-03/2014-03-06-00.csv
/oil_type/gas_station/2014-03/2014-03-06-01.csv

Per File are about 400 high redundant entries each about 5kb A File can be compressed from 1722 KB to 39 KB so an compression ratio of 44:1 up to 100:1 depending on the compression chunk size should be possible.

Defining the use case:

I have to poll all relevant gas_station webpages/apis every 30 seconds to get up to the minute pricing information, because it is not possible to write a parser for every gas station, a generic solution is required for index creation. With a database holding all craweld gas station pages a generic parser can easily be developed and backtestet. With this raw data model data loss through broken specific converters should be avoided.

With keys like "oil_type-gas_station-timestamp-content", its easy and efficient to compare two gas_station pricings over time. For reading a Time Serie that is smaller then the compression chunk size only 2 to 4 chunks should be decompressed.

So the following features are optimal:

SSTables
Configurable Compression Options (Level,Compression Engine,Chunk Size (from 64kb to 10 MB))
Range Scans
Java Bindings
column datasore for better compression

Nice to have:

Replication
Multi Master
write quorum of 1
Forward and backward iteration over the data. (to compare two time series)
configurable replica distribution
few dependencies

Question:

Wich free Database is able to hold archived data of high redundant crawl data (only a few bytes change) , compresses good and does not use too much time to query a random record. (In opposit to the mysql archive format, that has to decompress the whole table until the requested row)

Maybe there is a log database, that is able to index a lot of log lines and compresses them internaly? (scope of logstash, fluentd, flume)

If someone would know some benchmarks, numbers on this topic it would help a lot, to evaluate the right technology.

I am glad for your help!

Need a highly compressed datastore for Crawl data and log data

Answers (1)

Related Questions