Reputation: 2084
I put almost 190 million records
in Cassandra(2.1.11) cluster with 3 nodes, and the replication factor is 1
, then I write client application to count the all records using datastax's Java Driver
, the snippet code as follows:
Statement stmt = new SimpleStatement("select * from test" );
System.out.println("starting to read records ");
stmt.setFetchSize(10000);
ResultSet rs = session.execute(stmt);
//System.out.println("rs.size " + rs.all().size());
long cntRecords = 0;
for(Row row : rs){
cntRecords++;
if(cntRecords % 10000000 == 0){
System.out.println("the " + cntRecords/10000000 + " X 10 millions of records");
}
}
After the above variable cntRecords
is more than 30 millions, I always get the exception:
Exception in thread "main" com.datastax.driver.core.exceptions.ReadTimeoutException:
Cassandra timeout during read query at consistency ONE (1 responses were required but only
0 replica responded)
I got several results in google and changed the settings about heap and GC, the following is my relative settings:
-XX:InitialHeapSize=17179869184
-XX:MaxHeapSize=17179869184
-XX:MaxNewSize=12884901888
-XX:MaxTenuringThreshold=1
-XX:NewSize=12884901888
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseCompressedOops
-XX:+UseConcMarkSweepGC
-XX:+UseCondCardMark
-XX:+UseGCLogFileRotation
-XX:+UseParNewGC
-XX:+UseTLAB
-XX:+UseThreadPriorities
-XX:+CMSClassUnloadingEnabled
and I used GCViewer to analysis the gc log file and the througputs are 99.95%, 98.15% and 95.75%.
UPDATED BEGIN:
And I used jstat
to monitor one of the three nodes and found that when the S1
's value changed into 100.00
I will get the above error quickly:
/usr/java/jdk1.7.0_80/bin/jstat -gcutil 8862 1000
S0 S1 E O P YGC YGCT FGC FGCT GCT
0.00 100.00 28.57 36.29 74.66 55 14.612 2 0.164 14.776
And once S1
changed into 100.00
, S1
no longer will decrease, I don't know this is relative to the error? Or what property in cassandra.yaml
or cassandra-env.sh
I should set for this?
What should I do for finishing the task to count the all records? Thanks in advance!
ATTACH: the following is other options:
-XX:+CMSEdenChunksRecordAlways
-XX:CMSInitiatingOccupancyFraction=75
-XX:+CMSParallelInitialMarkEnabled
-XX:+CMSParallelRemarkEnabled
-XX:CMSWaitDuration=10000
-XX:CompileCommandFile=bin/../conf/hotspot_compiler
-XX:GCLogFileSize=94371840
-XX:+HeapDumpOnOutOfMemoryError
-XX:NumberOfGCLogFiles=90
-XX:OldPLABSize=16
-XX:PrintFLSStatistics=1
-XX:+PrintGC
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGC
-XX:+PrintPromotionFailure
-XX:+PrintTenuringDistribution
-XX:StringTableSize=1000003
-XX:SurvivorRatio=8
-XX:ThreadPriorityPolicy=42
-XX:ThreadStackSize=256
Upvotes: 1
Views: 2102
Reputation: 39182
Examine why you need to know the number of rows. Does your application really need to know this? If it can survive with "just" a good approximation, then create a counter and increment it as you load your data.
http://docs.datastax.com/en/cql/3.1/cql/cql_using/use_counter_t.html
Things you can try:
*
. This might reduce by GC pressure and network consumption. Preferably pick a column that has a small number of bytes and is part of the primary key: select column1 from test
cassandra.yaml
on your nodes and increase range_request_timeout_in_ms
and read_request_timeout_in_ms
Upvotes: 2