Reputation: 111
I am using presto to query Cassandra records, it is taking around 8 mins to respond the result. Need to improve the response time.
Presto configuration below:
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=3GB
discovery-server.enabled=true
discovery.uri=http://URL:8080
task.max-worker-threads=10
task.concurrency=32
Worker : 4
coordinator=false
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=2GB
discovery.uri=http://URL:8080
task.max-worker-threads=16
task.concurrency=32
Cassandra : 4 NODE
Fragment 2
Cost: CPU 1.98m, Input: 17833912 rows (1.49GB), Output: 13089502 rows (1.31GB)
ScanFilterProject[table = cassandra:cassandra:rasapp:raslog, originalConstraint = (("bucketid" = CAST('2017062113'
Cost: 96.12%, Input: 23169736 rows (22.10MB), Output: 17833912 rows (1.49GB), Filtered: 23.03%
How to improve the response time in presto still i m using partition key which has around 23 million records?
CREATE TABLE TEST.TEST_LOG (
bucketId varchar,
id timeuuid,
transaction_id varchar,
ras_transaction_id varchar,
msg_seq_id int,
host_name varchar,
matip_channel_id varchar,
hth_id varchar,
mq_id varchar,
log_point varchar,
entry_time timestamp,
exit_time timestamp,
source_carrier varchar,
destination_carrier varchar,
source_dcs varchar,
destination_dcs varchar,
message_type varchar,
message_direction int,
error_code_business varchar,
exception_code varchar,
exception_description varchar,
scenario varchar,
created_date timestamp,
huborcar varchar,
noof_fanout varchar,
flight_date timestamp,
route_origin varchar,
route_destination varchar,
class_service varchar,
no_of_seats varchar,
ras_host varchar,
cp_host varchar,
PRIMARY KEY(bucketid, created_date, msg_seq_id,message_direction,scenario,source_dcs,exception_code,log_point,transaction_id,id)
) WITH default_time_to_live = 2851200 and CLUSTERING ORDER BY (created_date ASC, msg_seq_id ASC,message_direction ASC,scenario ASC,source_dcs ASC,exception_code ASC,log_point ASC,transaction_id ASC,id ASC);
Query
select
transaction_id,
message_direction,
message_type,
max(exception_code) as exception_code,
min(entry_time) as min_entry,
max(entry_time) as max_entry,
min(exit_time) as min_exit,
max(exit_time) as max_exit
from TEST.TEST_LOG
where bucketid='2017062113'
and (
((msg_seq_id<=2 and message_type='PAOREQ' ) or
( msg_seq_id>2 and message_type='PAORES' )))
group by transaction_id,
message_direction,
message_type
Time taken : 8 mins
Thanks,
Upvotes: 1
Views: 868
Reputation: 711
Two things: The 0.180 release of Presto will include pushdown of inequality predicates on clustering keys, which will help out your query. Also, your schema does not work well with the query that you are running. In Cassandra, it is best to a) query on particular partitions (which you do) and also to have predicates on the clustering keys in the order in which you use them (since that's the sort order that Cassandra uses). You will probably see better performance if you have a primary key of (bucketid, message_type, msg_seq_id, ...).
Additionally, Presto does not push down aggregations to Cassandra (or any connector), so if there is a large amount of data that you're aggregating, and you don't need Presto for the federated query, it may be faster to just do the query in Cassandra.
Upvotes: 0