Augustin
Augustin

Reputation: 111

Presto Cassandra Slow performance slow

I am using presto to query Cassandra records, it is taking around 8 mins to respond the result. Need to improve the response time.

Presto configuration below:

   coordinator=true
   node-scheduler.include-coordinator=false
   http-server.http.port=8080
   query.max-memory=5GB
   query.max-memory-per-node=3GB
   discovery-server.enabled=true
   discovery.uri=http://URL:8080
   task.max-worker-threads=10
   task.concurrency=32 

   Worker : 4

   coordinator=false
   http-server.http.port=8080
   query.max-memory=5GB
   query.max-memory-per-node=2GB
   discovery.uri=http://URL:8080
   task.max-worker-threads=16
   task.concurrency=32

   Cassandra : 4 NODE 

Fragment 2 Cost: CPU 1.98m, Input: 17833912 rows (1.49GB), Output: 13089502 rows (1.31GB)
ScanFilterProject[table = cassandra:cassandra:rasapp:raslog, originalConstraint = (("bucketid" = CAST('2017062113' Cost: 96.12%, Input: 23169736 rows (22.10MB), Output: 17833912 rows (1.49GB), Filtered: 23.03%

How to improve the response time in presto still i m using partition key which has around 23 million records?

CREATE TABLE TEST.TEST_LOG (
  bucketId              varchar,
  id                    timeuuid,
  transaction_id        varchar,
  ras_transaction_id    varchar,
  msg_seq_id            int,
  host_name             varchar,
  matip_channel_id      varchar,
  hth_id                varchar,
  mq_id                 varchar,
  log_point             varchar,
  entry_time            timestamp,
  exit_time             timestamp,
  source_carrier        varchar,
  destination_carrier   varchar,
  source_dcs            varchar,
  destination_dcs       varchar,
  message_type          varchar,
  message_direction     int,
  error_code_business   varchar,
  exception_code        varchar,
  exception_description varchar,
  scenario              varchar,
  created_date          timestamp,
  huborcar              varchar,
  noof_fanout           varchar,
  flight_date           timestamp,
  route_origin          varchar,
  route_destination     varchar,
  class_service         varchar,
  no_of_seats           varchar,
  ras_host              varchar,
  cp_host               varchar,
  PRIMARY KEY(bucketid, created_date, msg_seq_id,message_direction,scenario,source_dcs,exception_code,log_point,transaction_id,id)
) WITH default_time_to_live = 2851200 and CLUSTERING ORDER BY (created_date ASC, msg_seq_id ASC,message_direction ASC,scenario ASC,source_dcs ASC,exception_code ASC,log_point ASC,transaction_id ASC,id ASC);

Query

select
transaction_id,
message_direction,
message_type,
max(exception_code) as exception_code,
min(entry_time) as min_entry,
max(entry_time) as max_entry,
min(exit_time) as min_exit,
max(exit_time) as max_exit
from TEST.TEST_LOG
where bucketid='2017062113'
and (
((msg_seq_id<=2 and message_type='PAOREQ'  ) or
( msg_seq_id>2 and message_type='PAORES'  )))
group by transaction_id,
message_direction,
message_type

Time taken : 8 mins

Thanks,

Upvotes: 1

Views: 868

Answers (1)

Christina Foley
Christina Foley

Reputation: 711

Two things: The 0.180 release of Presto will include pushdown of inequality predicates on clustering keys, which will help out your query. Also, your schema does not work well with the query that you are running. In Cassandra, it is best to a) query on particular partitions (which you do) and also to have predicates on the clustering keys in the order in which you use them (since that's the sort order that Cassandra uses). You will probably see better performance if you have a primary key of (bucketid, message_type, msg_seq_id, ...).

Additionally, Presto does not push down aggregations to Cassandra (or any connector), so if there is a large amount of data that you're aggregating, and you don't need Presto for the federated query, it may be faster to just do the query in Cassandra.

Upvotes: 0

Related Questions