Reputation: 79
I have a cassandra database which contains a table over 10B entries with no indexes. I need to get every row and do some data grouping. However I did using java & spring boot framework and it only scanned 2B records which is the cassandra limit on select * from abc.abc as documented here : https://issues.apache.org/jira/browse/CASSANDRA-14683 Is there a way in java to do it ? I tried ds bulk but that count whole table and does not read each row.
Upvotes: 0
Views: 493
Reputation: 1695
It is totally possible to process such an amount of data.
Your issue is likely relative to the abstraction layer (Spring Data Cassandra). The java drivers can easily process this amount data with 2 features (that can be used at the same time):
1/ Spark
Spark uses the token range trick. You should look at Spark Cassandra Connector (SCC). Now it is scala and a spark runtime.
2/ DSBulk
Allows only data copy or count so not something you can use to process you data.
3/ Apache Beam
A not-so-famous great way to process in a distributed way. It distributes the load among nodes as well and java compatible. code. Hre some sample
Upvotes: 1
Reputation: 466
I'm not sure what you mean by scan, but you have two options when you want all of the data within a Cassandra table:
dsbulk spark
Trying to do a query like select * from foo.bar
is never going to be efficient and will more than likely timeout with 10B rows.
The above two applications will break the queries down into partition range queries and you can set throttles to limit the number of operations so you don't get those same timeouts.
With dsbulk, you can get a count, or you can unload the data to a csv if you need to do that.
Upvotes: 1
Reputation: 4506
The issue you mentioned, CASSANDRA-14683, refers to a limitation where SELECT statements without a specific partition key can only scan up to 2 billion rows.
Alternative approach to solve this could be using Pagination using token range and read your data in chunks untill all the data is fetched.
Something like below:
private static final int PAGE_SIZE = 10000; // Number of rows to fetch per page
public void getData() {
try (CqlSession session = CqlSession.builder().build()) {
Metadata metadata = session.getMetadata();
TokenRange fullRange = metadata.getTokenMap().get().getPagingTokenRange();
PagingIterable<Row> rows = session.execute(createPagingQuery(fullRange));
for (Row row : rows) {
// Process your data
}
}
}
private String createPagingQuery(TokenRange range) {
return "SELECT * FROM YOUR_TABLE WHERE token(partition_key) >= " + range.getStart().getValue()
+ " AND token(partition_key) <= " + range.getEnd().getValue()
+ " LIMIT " + PAGE_SIZE;
}
Upvotes: 1