Reputation: 447

Data is storing unorganized in Datastax Astra DB

I have uploaded my csv file in my Datastax Astra DB. The thing is why the data when it is uploaded, the rows gets unorganized or random. Say if in my csv the row id is like 1, 2, 3, ... Then when I am uploading this to my DB, it gets uploaded scrambled/rearranged say like row id 2, 1, 4, 5, 3 ... Due to this reason when I am requesting the data via http_methods in python, the data is also coming as it is stored i.e. unorganized way. Why is this happening ? And is there a way to solve this ?

Upvotes: 1

Answers (2)

Aaron

Reputation: 57798

Why is this happening ?

dwettlaufer's answer is spot-on, but I'll provide a visual representation to help with understanding just what's happening here:

> SELECT id,token(id),name FROM numbers ;

 id | system.token(id)     | name
----+----------------------+-------
  5 | -7509452495886106294 |  five
  1 | -4069959284402364209 |   one
  2 | -3248873570005575792 |   two
  4 | -2729420104000364805 |  four
  3 |  9010454139840013625 | three

(5 rows)

Data in Astra DB is stored in order by the hashed token value of the partition key. Because of the hash, numeric values are quite likely to be arranged in a different order.

And is there a way to solve this ?

As was mentioned, sort order can only be enforced within a partition key. So if a different column is used as the partition key, id can be used to sort data within it.

Consider a table that looks like this:

CREATE TABLE dept_numbers (
    dept int,
    id int,
    name TEXT,
PRIMARY KEY(dept,id));

Then something like this would work:

> SELECT * FROM dept_numbers WHERE dept=1;

 dept | id | name
------+----+-------
    1 |  1 |   one
    1 |  2 |   two
    1 |  3 | three
    1 |  4 |  four
    1 |  5 |  five

(5 rows)

Note that all queries with Astra DB should have a WHERE clause. Without it, an expensive table scan will result.

Upvotes: 2

dwettlaufer

Reputation: 378

What it comes down to is how Cassandra stores your data. I’m assuming that you’re only defining a partition key within the primary key. The partition key is responsible for how the data is stored on the nodes in your cluster. If you want your data in a particular order then you’ll need to define the second part of the primary key, the clustering key.

I’d suggest taking a look at this answer which gives an excellent explanation of ordering https://stackoverflow.com/a/35708544/13078421

Upvotes: 2

Data is storing unorganized in Datastax Astra DB

Answers (2)

Related Questions