Reputation: 440

How to include SQL select statement in dsbulk unload command

I have a huge orderhistory table in cassandra having data from 2013, But I want only last 12 months of orderhistory data to be unloaded, I use the below command to do it which unloads all the data starting from 2013 and stores in the path data/json/customer_data/orderhistory/data. How do I modify the below statement such that each time I run this it should select only last 12months of data?

dsbulk unload -k customer_data -t crawlsiteidentifiedpages -h '172.xx.xx.xxx' \
  -c json -url data/json/customer_data/orderhistory/data

Upvotes: 1

Answers (2)

adutra

Reputation: 4536

Instead of -t crawlsiteidentifiedpages you should use -query and provide the SELECT query, e.g.:

-query "SELECT * FROM crawlsiteidentifiedpages WHERE token(pk) > :start and token(pk) <= :end and date > maxTimeuuid('2021-06-21+0000') ALLOW FILTERING"

A few remarks:

I assume your table has one partition key column pk and one clustering column date of type timeuuid – please adjust the actual query accordingly.
The WHERE restriction token(pk) > :start and token(pk) <= :end allows DSBulk to parallelize the operation and improves performance.
The WHERE restriction date > maxTimeuuid('2021-06-21+0000') is where the magic happens and allows you to select only the last 12 months of data.
Unfortunately, you also need to add ALLOW FILTERING to this type of query, otherwise Cassandra will reject the query.

Upvotes: 1

Alex Ott

Reputation: 87184

You need to remove options -k and -t, and instead use the -query option as described in documentation, like:

dsbulk unload -query 'select * from ks.table where <your condition>'

To make sure that unload is parallelized, make sure that your condition includes part like and token(pkcol) > :start and token(pkcol) <= :end where pkcol is the name of the partition column (if you have multiple partition columns, specify them comma-separated).

Upvotes: 1

How to include SQL select statement in dsbulk unload command

Answers (2)

Related Questions