Reputation: 11
Table Structure : testkeyspace.productInfo (productID, productName, timestamp, price) - Partition key column in the above table is “productId”. - table -> testkeyspace.productInfo consists of 10k records
Test Case
How data is read from Cassandra table? (Read Pattern)
Issue
Note: Above mentioned issue is applicable even when using a numeric-based column in the WHERE clause of the select query.
It's tested and observed in the single node cluster.
Can someone explain this behaviour and let me know why this is happening?
Thanks in advance.
Upvotes: 1
Views: 93
Reputation: 16323
What you see as "data loss" is most likely inconsistency between replicas.
What is probably happening is that the nodes cannot keep up with writes and dropping mutations leading to inconsistent replicas.
You need to run repairs regularly if you aren't already. But the long term solution is to add more nodes to your cluster so the nodes can keep up with the load.
In addition, I agree with Aaron that you need to redesign your data model because the table is not a good fit for your app query.
As a side note, it doesn't make sense to use the JDBC driver when you have a Java application. You should switch to using the Cassandra Java driver so you can natively query the database with CQL.
If you want to learn how, we have a lot of free tutorials and courses at datastax.com/dev. Cheers!
Upvotes: 0
Reputation: 57748
While Step 4 is executed in parallel with the write application, few random records are missing in the ResultSet returned from the JDBC call.
Can someone explain this behaviour and let me know why this is happening?
Sure. The problem is with the queries:
Initially, "SELECT *" is issued to Cassandra table
No mention of a WHERE
clause, so I'll assume that there isn't one.
The subsequent "SELECT *" are issued with a WHERE condition of timestamp > lastNotedTimestamp
So this one does have a WHERE
clause. However, I don't see the partition key inside of it.
When you query Cassandra without specifying a partition key, there are no guarantees that the query can be served by a single node. This means (for both queries) that one node is being selected as a coordinator, to query the other nodes exhaustively. Once that node has assembled the result set (in RAM) it then returns it to the application.
Non-partition key queries cause Cassandra nodes to work very hard. Now, add into the mix that the cluster is also being asked it to handle the throughput of the write application, and I'm not surprised that you're seeing issues.
Essentially, this is happening because the current read/write patterns are too resource-intensive for your cluster to handle.
You can:
SELECT *
(without a WHERE
clause) or the ALLOW FILTERING
directive.Edit 20230518
i tried with the partition key column in the where clause like "SELECT * FROM keyspace.table where partitionkeyColumn > latestPartitionColumnObservedPosition ALLOW FILTERING". I observed few record missing randomly for this case also.
Yes, because that query is still causing stress on all of the nodes. The partition key needs to be specified with an equality operator. Sure, it's getting a partition key. But all of the partition keys greater than latestPartitionColumnObservedPosition
will still be spread across multiple nodes, so nothing is improving.
Given the table definition provided above, this table can support one performant query:
SELECT * FROM productInfo WHERE productId=?;
That's it. If there's any other query running on that table, it's going to be hitting multiple nodes and causing them to work hard.
If the concern is about the timestamp
, then you might try building a new table with a "bucketing" approach, like this:
CREATE TABLE productinfo_by_day_bucket (
productid TEXT,
daybucket INT,
productname TEXT,
producttimestamp TIMESTAMP,
price DECIMAL,
PRIMARY KEY (daybucket, producttimestamp, productid)
WITH CLUSTERING ORDER BY (producttimestamp DESC, productid ASC);
This will store products updated for a particular day in a single partition, meaning that they can be queried like:
SELECT * FROM productinfo_by_day_bucket
WHERE daybucket=20230518
AND producttimestamp > '2023-05-18 14:10:05.705000+0000';
Depending on how many products are updated in a particular day, that "time bucket" may need to be a different unit of measure. But that's up to the business requirements. This also allows filtering on timestamp, as long as it's within a partition.
Note that producttimestamp
is specified as a clustering key, sorting in descending order. This is because most time-based application queries tend to care about the most-recent data.
Sorting in descending order allows queries to quickly pull records from the "top" of the partition, which will be faster.
productid
is the last clustering key, but that's really just there as a tie-breaker to ensure uniqueness.
You should check out DataStax Academy, especially DS220 which is our free class on Cassandra data modeling. It really does a good job of explaining how everything works under-the-hood, and gives you data modeling and query building strategies to leverage that.
Upvotes: 1