Vučko
Vučko

Reputation: 1

Spring Data Cassandra driver gets stuck after few hours, with single-node database on the same node

I've been having problems with Apache Cassandra database access via spring-data-cassandra:

The application is a small Spring Boot (1.4.0) server application using Spring Data Cassandra (tried 1.4.2 and 1.4.4). The application collects data from remote clients and implements some administrative GUI based on a REST interface on the server side, including a dashboard prepared every 10 seconds by using Spring @Scheduled tasks and delivering data to clients (browsers) via websocket protocol. Traffic is secured by using HTTPS and bilateral authentication (server + client certificates).

The current state of application is being tested in a setup with a database (2.2.8), running on the same cloud server (connecting via loopback 127.0.0.1 address) having Ubuntu 14.04 OS. A couple of test clients create load resulting in around 300k database records per hour (50k master and 5x50k detail records) being inserted, uploading data every 5 seconds or so. The dashboard is trawling through the last hour of traffic and creating statistics. Average CPU use from the sar utility is around 10%. Current database size is around 25GB.

Data inserts are made in small batches - I've tried also individual writes but the problem hasn't disappeared, just the CPU usage got increased for around 50% while testing with single writes.

I've done a lot of Google "research" about the topic and found nothing specific, but tried quite a few of advices as e.g. putting schema name in all queries and a couple of configuration options - with apparently no effect to the final outcome (blocked server needing restart). Server has run for up to 30 hours or so, but sometimes gets blocked within 1-2 hours, usually running 7-10 hours before the driver getting stuck, with no obvious pattern in the running period.

I've been monitoring the heap - nothing particular to see, no data structures piling up with time. Server is run with -Xms2g -Xmx3g -XX:+PrintGCDetails

The error eventually appearing is:

Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: inpresec-cassandra/127.0.1.1:9042 (com.datastax.driver.core.OperationTimedOutException: [inpresec-cassandra/127.0.1.1:9042] Operation timed out))
        at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:217) ~[cassandra-driver-core-2.1.9.jar!/:na]
        at com.datastax.driver.core.RequestHandler.access$1000(RequestHandler.java:44) ~[cassandra-driver-core-2.1.9.jar!/:na]
        at com.datastax.driver.core.RequestHandler$SpeculativeExecution.sendRequest(RequestHandler.java:276) ~[cassandra-driver-core-2.1.9.jar!/:na]
        at com.datastax.driver.core.RequestHandler$SpeculativeExecution$1.run(RequestHandler.java:374) ~[cassandra-driver-core-2.1.9.jar!/:na]
        ... 3 common frames omitted

What I have also noticed is that the cassandra process reports the virtual memory size matching approximately the size of the database - i noticed it when the database was around 12GB and it has been following the database size faithfully - not sure if this has anything to do with the server problem. The resident part of the database is between 2 and 3GB. The resident part of the server is typically 1.5-2.5GB. Total memory of the cloud server is 8GB.

Before running Cassandra directly in the host VM OS, I was running it in Docker and had the same problem - moving out of Docker was done to exclude Docker from the "list of suspects".

If anybody had anything similar I'd appreciate information or advice.

Upvotes: 0

Views: 879

Answers (1)

Vučko
Vučko

Reputation: 1

The problem has apparently been solved by upgrading Netty and providing support for epoll protocol to be used instead of the default fallback to NIO. Originally in pom.xml there was:

<dependency>
    <groupId>io.netty</groupId>
    <artifactId>netty-all</artifactId>
    <version>4.0.9.Final</version>
</dependency>

Now this has been changed to:

    <dependency>
        <groupId>io.netty</groupId>
        <artifactId>netty-all</artifactId>
        <version>4.0.29.Final</version>
    </dependency>

    <dependency>
        <groupId>io.netty</groupId>
        <artifactId>netty-transport-native-epoll</artifactId>
        <version>4.0.29.Final</version>
        <!-- Explicitly bring in the linux classifier, test may fail on 32-bit linux -->
        <classifier>linux-x86_64</classifier>
        <scope>test</scope>
    </dependency>

adding the second specification for explicit inclusion of the epoll support, as sugested here.

After this change, the original message appearing in the log file:

com.datastax.driver.core.NettyUtil       : Did not find Netty's native epoll transport in the classpath, defaulting to NIO.

has changed into:

com.datastax.driver.core.NettyUtil       : Found Netty's native epoll transport in the classpath, using it

Since then there have been no random failures - tried "killing" the DB connection by creating extra large queries several time - it dutifully reported memory error - and then recovered.

Upvotes: 0

Related Questions