Reputation: 21
Have a cassandra 2.x database where from a table I have to select some data using non primary key column which is a primary of another table using a simple where clause. The data is used in a cache. The problem is that the retrieval process is too slow and gets timedout using the datastax 3.x driver. Is there any way to fetch the data without upgrading the database software or altering the existing structure of database. I tried async fetching and pagination using datastax api - still it couldn't cope up with the volume of data and the query fails.
Upvotes: 2
Views: 1718
Reputation: 87069
Cassandra is heavily optimized for access to data by primary key - full, partial, or at least partition key. Other access patterns require additional work. Theoretically you can use secondary index on the corresponding column, but it's only recommended if you're searching the data in addition to having at least partition key - if you just use that column alone, it will still reach all nodes and fetch all data, so it will be much slower. And you'll need to keep in mind other limitations, such as, cardinality of the column, etc. (you can read about that here).
Programmatically, you can do the full scan of data as well, but it shouldn't be simple select * from table
as it will overload coordinating node, lead to timeouts, etc. Instead it should be more sophisticated solution - it's better to perform scan by reading data from individual token ranges, sending the queries to the nodes that are keeping the corresponding ranges, and it's possible to do in parallel - this is how Spark Cassandra Connector and DSBulk are working (I think that you may try to adopt pieces of DSBulk code for this task - it's possible to use it as a library). I also have an example of how to perform full table scan using the Java driver - you can adopt this code, and replace simple counting with your filtering condition.
Upvotes: 3