Reputation: 8348
Input: Tabular data files: each file has variable number of columns, and those column names may be unique to the file and/or shared between other files (unknown beforehand). This data is not expected to change, so inserts will be frequent but updates will be rare. Columns and their assiated values may be textual or numeric.
Requested Capability: Be able to query by identity or range to retrieve the row(s) of data given a column name and value.
Data Model: In CQL
I can model using a single table representing a specific cell value of a file (in this case textual, but a similar table can be constructed for numeric data)
create table mytable(
colname text,
value text,
filename text,
rowid int,
data map<text,text>,
primary key (colname , value, filename, rowid)#partitioning on colname may not be ideal here, but will be dealt with in ways unrelated to this question
);
For example, the contents of one file could be:
A B C D E
i1 i2 i3 i4 i5
The insert would be:
insert into mytable(colname, value, data, filename, rowid) values ('A', 'i1', {'A':'i1', 'B':'i2', 'C':'i3', 'D':'i4', 'E':'i5', 'F':'i5'}, 'F1', 1);
insert into mytable(colname, value, data, filename, rowid) values ('B', 'i2', {'A':'i1', 'B':'i2', 'C':'i3', 'D':'i4', 'E':'i5', 'F':'i5'}, 'F1', 2);
insert into mytable(colname, value, data, filename, rowid) values ('C', 'i3', {'A':'i1', 'B':'i2', 'C':'i3', 'D':'i4', 'E':'i5', 'F':'i5'}, 'F1', 3);
insert into mytable(colname, value, data, filename, rowid) values ('D', 'i4', {'A':'i1', 'B':'i2', 'C':'i3', 'D':'i4', 'E':'i5', 'F':'i5'}, 'F1', 4);
insert into mytable(colname, value, data, filename, rowid) values ('E', 'i5', {'A':'i1', 'B':'i2', 'C':'i3', 'D':'i4', 'E':'i5', 'F':'i5'}, 'F1', 5);
...
SELECT data from mytable where colname=? and value=?
Problem: We do have data storage constraints - in this model the data value is identical for each column value of a row, resulting in lots of excessive data duplication (this is a simplistic example but the number of columns could be upwards of 100's - meaning the map can be duplicated hundreds of times).
Question: Is there any way in Cassandra to avoid this type of data duplication without having to create/query a second table (for example via a pointer to the data or somehow define the map as distinct)? Or alternatively, a different way to model the data with the same query capability and result?
Upvotes: 2
Views: 185
Reputation: 5180
I wouldn't be surprised if your design benefits from C* built-it compression, your storage requirements may be well under your expectations.
By the way, if you want a completely different model you can try:
Something along these lines:
CREATE TABLE colname_A (
value text,
data map<text,text>,
...,
PRYMARY KEY (value)
);
CREATE TABLE colname_B (
value text,
data map<text,text>,
...,
PRYMARY KEY (value)
);
....
CREATE TABLE colname_xxx (...);
You can then select your data by issuing:
SELECT * FROM colname_A WHERE value = 'i5';
SELECT * FROM colname_A WHERE value IN ('i4', 'i5')
Be careful with queries with the IN
clausole in the partition key because you want to avoid problems like this when you keep loading data into your cluster.
This model organizes data in a different manner and benefits from inter-column data duplication instead of inter-row data duplication. If you have unbalanced columns (eg lot of records in some columns) the compression offered by this data organization could be a big win.
Upvotes: 1