Reputation: 440
I have a huge orderhistory table in cassandra having data from 2013, But I want only last 12 months of orderhistory data to be unloaded, I use the below command to do it which unloads all the data starting from 2013 and stores in the path data/json/customer_data/orderhistory/data
. How do I modify the below statement such that each time I run this it should select only last 12months of data?
dsbulk unload -k customer_data -t crawlsiteidentifiedpages -h '172.xx.xx.xxx' \
-c json -url data/json/customer_data/orderhistory/data
Upvotes: 1
Views: 1651
Reputation: 4536
Instead of -t crawlsiteidentifiedpages
you should use -query
and provide the SELECT query, e.g.:
-query "SELECT * FROM crawlsiteidentifiedpages WHERE token(pk) > :start and token(pk) <= :end and date > maxTimeuuid('2021-06-21+0000') ALLOW FILTERING"
A few remarks:
pk
and one clustering column date
of type timeuuid
– please adjust the actual query accordingly.token(pk) > :start and token(pk) <= :end
allows DSBulk to parallelize the operation and improves performance.date > maxTimeuuid('2021-06-21+0000')
is where the magic happens and allows you to select only the last 12 months of data.ALLOW FILTERING
to this type of query, otherwise Cassandra will reject the query.Upvotes: 1
Reputation: 87184
You need to remove options -k
and -t
, and instead use the -query
option as described in documentation, like:
dsbulk unload -query 'select * from ks.table where <your condition>'
To make sure that unload is parallelized, make sure that your condition includes part like and token(pkcol) > :start and token(pkcol) <= :end
where pkcol
is the name of the partition column (if you have multiple partition columns, specify them comma-separated).
Upvotes: 1