Reputation: 2881

Cassandra select CQL: Cannot add column after wildcard

I need to output the write timestamp as part of a table export for lots of tables, though I quite cannot figure out a way which does not force me to explicitely select all columns in the statement.

Instead of being able to do just this:

SELECT *, writetime(data) AS timestamp  FROM dls.licenses;

I have to do that:

SELECT column1, column2, ... , writetime(data) AS timestamp  FROM dls.licenses;

This is pretty unconvenient since it means I'd have to change the export tool every time the schema of any of the tables changes.

Is there a better way?

Edit: To clarify, the actual error I get is the following. The way the syntax is presented in the error one could think that the SQL should be ok:

SELECT *, writetime(id) AS timestamp  FROM dls.licenses;
SyntaxException: line 1:8 mismatched input ',' expecting K_FROM (SELECT *[,]...)

Edit 2: Here is the keyspace and create statement used for this table:

CREATE KEYSPACE IF NOT EXISTS dls WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': ‚1‘ };
CREATE TABLE IF NOT EXISTS dls.licenses (subscription_id text, id text, key text, data text, PRIMARY KEY (key));
CREATE INDEX IF NOT EXISTS ON dls.licenses (id);

BTW: I'm using the fresh Cassandra 4.0.0 (GA).

Upvotes: 1

Answers (3)

Erick Ramirez

Reputation: 16293

The star (*) in SELECT * is the CQL syntax for "ALL columns" so by definition, it is not possible to include another column since ALL of them are selected even for native CQL functions. For this reason, you need to enumerate all column names + functions-on-columns.

+1 to Yuki's answer. I wanted to add that DSBulk adds a WRITETIME() column for every column in the table because it isn't possible to know in advance the write-time of each column in the partition until the full partition has been read.

Allow me to explain it using a couple of examples.

Schema

Consider this table:

CREATE TABLE users_by_email (
    email text,
    name text,
    address text,
    mobile text,
    PRIMARY KEY (email)
)

Example 1

If we add a new record with a value specified for all columns:

INSERT INTO users_by_email (email, name, address, mobile)
  VALUES ('[email protected]', 'Alice', '221B Baker St', '098-765-432-109');

then for this partition, all columns will have the same write-time.

Example 2

Consider a situation where a record is fragmented across multiple inserts over a period of time such as:

INSERT INTO users_by_email (email, name)    VALUES ('[email protected]', 'Bob');

INSERT INTO users_by_email (email, address) VALUES ('[email protected]', '350 Fifth Ave');

INSERT INTO users_by_email (email, mobile)  VALUES ('[email protected]', '012-555-123-456');

Each of the columns name, address and mobile would all have different write-times.

From these 2 examples, you should see that there isn't always a single write-time that applies to all columns in the partition.

For your specific use case, you need to figure out from the DSBulk output which write-time to use for situations where the partition fragments are inserted/updated at different times. Cheers!

Upvotes: 0

yukim

Reputation: 641

If you are exporting to CSV or JSON files, you may consider using DataStax's dsbulk.

https://github.com/datastax/dsbulk

The latest version of dsbulk 1.8.0 added support to export writetime and ttl.

https://docs.datastax.com/en/dsbulk/doc/dsbulk/reference/schemaOptions.html#schemaOptions__schemaOptionsPreserveTimestamp

dsbulk unload -url myData.csv -k ks1 -t table1 --timestamp

Upvotes: 1

about_cassandra

Reputation: 1

The WHERE clause specifies which rows must be queried. It is composed of relations on the columns that are part of the PRIMARY KEY and/or have a secondary index defined on them.

The column specification of the relation must be one of the following:
One or more members of the partition key of the table
A clustering column, only if the relation is preceded by other relations that specify all columns in the partition key
A column that is indexed using CREATE INDEX.

In Cassandra 3.6 and later, add ALLOW FILTERING to filter only on a non-indexed cluster column.

You may be able to solve your query problem by creating a secondary index on the column you want the writetime for. Keep in mind secondary indexes create overhead and which may result in unintended consequences.