Map redundancy in Cassandra

Question

Input: Tabular data files: each file has variable number of columns, and those column names may be unique to the file and/or shared between other files (unknown beforehand). This data is not expected to change, so inserts will be frequent but updates will be rare. Columns and their assiated values may be textual or numeric.

Requested Capability: Be able to query by identity or range to retrieve the row(s) of data given a column name and value.

Data Model: In CQL I can model using a single table representing a specific cell value of a file (in this case textual, but a similar table can be constructed for numeric data)

create table mytable(
    colname text,
    value text,
    filename text,
    rowid int,
    data map,
    primary key (colname , value, filename, rowid)#partitioning on colname may not be ideal here, but will be dealt with in ways unrelated to this question
);

For example, the contents of one file could be:

A   B   C   D   E
i1  i2  i3  i4  i5

The insert would be:

insert into mytable(colname, value, data, filename, rowid) values ('A', 'i1', {'A':'i1', 'B':'i2', 'C':'i3', 'D':'i4', 'E':'i5', 'F':'i5'}, 'F1', 1);
insert into mytable(colname, value, data, filename, rowid) values ('B', 'i2', {'A':'i1', 'B':'i2', 'C':'i3', 'D':'i4', 'E':'i5', 'F':'i5'}, 'F1', 2);
insert into mytable(colname, value, data, filename, rowid) values ('C', 'i3', {'A':'i1', 'B':'i2', 'C':'i3', 'D':'i4', 'E':'i5', 'F':'i5'}, 'F1', 3);
insert into mytable(colname, value, data, filename, rowid) values ('D', 'i4', {'A':'i1', 'B':'i2', 'C':'i3', 'D':'i4', 'E':'i5', 'F':'i5'}, 'F1', 4);
insert into mytable(colname, value, data, filename, rowid) values ('E', 'i5', {'A':'i1', 'B':'i2', 'C':'i3', 'D':'i4', 'E':'i5', 'F':'i5'}, 'F1', 5);
...

SELECT data from mytable where colname=? and value=?

Problem: We do have data storage constraints - in this model the data value is identical for each column value of a row, resulting in lots of excessive data duplication (this is a simplistic example but the number of columns could be upwards of 100's - meaning the map can be duplicated hundreds of times).

Question: Is there any way in Cassandra to avoid this type of data duplication without having to create/query a second table (for example via a pointer to the data or somehow define the map as distinct)? Or alternatively, a different way to model the data with the same query capability and result?

Map redundancy in Cassandra

Answers (1)

Related Questions