improving cassandra read time in my scenerio

Question

I'm testing single node Datastax Cassandra 2.0 with default configuration with a client written using Astyanax.

In my scenario there is one CF, each row contains key (natural number parsed to string) and one column, that keeps 1kB of random text data.

Client performs operations of inserting rows, until the data size reaches 50GB. It does this with speed of 3000 req/sec, which is enough for me. Next step is to read all of this data, with the same order as they were inserted. And here come problems. Lets see example log, produced by my program:

reads   writes  time    req/sec
99998   0       922,59  108
100000  0       508,51  196
100000  0       294,85  339
100000  0       195,99  510
100000  0       137,11  729
100000  0       105,48  948
100000  0       105,83  944
100000  0       76,05   1314
100000  0       71,94   1389
100000  0       63,34   1578
100000  0       63,91   1564
100000  0       65,69   1522
100000  0       1217,52 82
100000  0       725,67  137
100000  0       502,03  199
100000  0       342,17  292
100000  0       336,83  296
100000  0       332,56  300
100000  0       330,27  302
100000  0       359,74  277
100000  0       320,01  312
100000  0       369,02  270
100000  0       774,47  129
100000  0       564,81  177
100000  0       729,50  137
100000  0       656,28  152
100000  0       611,29  163
100000  0       589,29  169
100000  0       693,99  144
100000  0       658,12  151
100000  0       294,53  339
100000  0       126,81  788
100000  0       206,13  485
100000  0       924,29  108

The throughput is unstable, and rather low.

I'm interested in any help, that may improve read time. I also can provide some more information.

Thanks for help!

Kuba

psanford · Accepted Answer

I'm guessing you are doing your read sequentially. If you do them in parallel you should be able to do many more operations per second.

Update to address single read latency:

Read latency can be affected by the following variables:

Is the row in memory (Memtable or Row cache)?
How many sstables is the row spread over?
How wide is the row?
How many columns need to be scanned past to find the column you are looking for?
Are you reading from the front of end of the row?
Does the row have tomstones?
Are you using leveled or size-tiered compaction?
Are the sstables in the disk cache or not?
How many replicas does the coordinator need to wait for?
How many other requests is the node servicing at the same time?
network latency
disk latency (rotational)
disk utilization (queue size/await) -- can be affected by compaction
disk read ahead size
Java GC pauses
CPU utilization -- can be affected by compactions
Context switches
Are you in swap?

There are a number of tools that can help you answer these questions, some specific to Cassandra and others general system performance tools. Look in the Cassandra logs for GC pauses and for dropped requests. Look at nodetool cfstats to see latency stats. Use nodetool cfhistograms to check latency distributions, the number of sstables hit per read, and row size distribution. Use nodetool tpstats to check for dropped requests and queue sizes.

You can also use tools like iostat and vmstat to see disk and system utilization stats.

improving cassandra read time in my scenerio

Answers (1)

Related Questions