Anish
Anish

Reputation: 79

How can I scan the entire cassandra table which has 10B entries and no indexing?

I have a cassandra database which contains a table over 10B entries with no indexes. I need to get every row and do some data grouping. However I did using java & spring boot framework and it only scanned 2B records which is the cassandra limit on select * from abc.abc as documented here : https://issues.apache.org/jira/browse/CASSANDRA-14683 Is there a way in java to do it ? I tried ds bulk but that count whole table and does not read each row.

Upvotes: 0

Views: 493

Answers (3)

clunven
clunven

Reputation: 1695

It is totally possible to process such an amount of data.

Your issue is likely relative to the abstraction layer (Spring Data Cassandra). The java drivers can easily process this amount data with 2 features (that can be used at the same time):

1/ Spark

Spark uses the token range trick. You should look at Spark Cassandra Connector (SCC). Now it is scala and a spark runtime.

2/ DSBulk

Allows only data copy or count so not something you can use to process you data.

3/ Apache Beam

A not-so-famous great way to process in a distributed way. It distributes the load among nodes as well and java compatible. code. Hre some sample

Upvotes: 1

stevenlacerda
stevenlacerda

Reputation: 466

I'm not sure what you mean by scan, but you have two options when you want all of the data within a Cassandra table:

dsbulk spark

Trying to do a query like select * from foo.bar is never going to be efficient and will more than likely timeout with 10B rows.

The above two applications will break the queries down into partition range queries and you can set throttles to limit the number of operations so you don't get those same timeouts.

With dsbulk, you can get a count, or you can unload the data to a csv if you need to do that.

Upvotes: 1

Sangam Belose
Sangam Belose

Reputation: 4506

The issue you mentioned, CASSANDRA-14683, refers to a limitation where SELECT statements without a specific partition key can only scan up to 2 billion rows.

Alternative approach to solve this could be using Pagination using token range and read your data in chunks untill all the data is fetched.

Something like below:

 private static final int PAGE_SIZE = 10000; // Number of rows to fetch per page

public void getData() {
    try (CqlSession session = CqlSession.builder().build()) {
        Metadata metadata = session.getMetadata();
        TokenRange fullRange = metadata.getTokenMap().get().getPagingTokenRange();
        PagingIterable<Row> rows = session.execute(createPagingQuery(fullRange));
        for (Row row : rows) {
            // Process your data
        }
    }
}

private String createPagingQuery(TokenRange range) {
    return "SELECT * FROM YOUR_TABLE WHERE token(partition_key) >= " + range.getStart().getValue()
            + " AND token(partition_key) <= " + range.getEnd().getValue()
            + " LIMIT " + PAGE_SIZE;
}

Upvotes: 1

Related Questions