Attempt to achieve high throughput in Hyperledger Fabric network

Hyperledger community in the article Hyperledger Fabric: A Distributed Operating System for Permissioned Blockchains shows that Fabric achieves end-to-end throughput of more than 3500 transactions per second in certain popular deployment configurations. I'm trying to achieve this result in my project, but I'm far from it. Here I report my first results of load testing and invite you to join the investigation how to achieve a high throughput with Hyperledger Fabric and Composer

Project descriptions

We build high-load service that uses Hyperledger Fabric. Our backend system consists of HF blockchain network, several microservices (node js) which communicate with blockchain via Hyperledger Composer, message broker for communication between microservices.

Hyperledger Fabric v1.1. Hypeledger Composer v0.19.0

Fabric network (deployed with Cello):

{
    fabric001: {
      cas: [],
      peers: ["[email protected]"],
      orderers: ["orderer1st.orderer"],
      zookeepers: ["zookeeper1st"],
      kafkas: ["kafka1st"]
    },
    fabric002: {
      cas: [],
      peers: ["[email protected]"],
      orderers: ["orderer2nd.orderer"],
      zookeepers: ["zookeeper2nd"],
      kafkas: ["kafka2nd"]
    },
    fabric003: {
      cas: [],
      peers: ["[email protected]"],
      orderers: ["orderer3rd.orderer"],
      zookeepers: ["zookeeper3rd"],
      kafkas: ["kafka3rd"]
    },
    fabric004: {
      cas: ["ca1st.main"],
      peers: [],
      orderers: ["orderer4th.orderer"],
      zookeepers: ["zookeeper4th"],
      kafkas: ["kafka4th"]
    }
}

fabric001-004 - AWS ec2 instances of t2.xlarge type. Initially, I used m5.4xlarge, but it costs a lot and CPU usage was always low even when Fabric starts to fail.

Fabric config:

BatchTimeout: 0.2s
BatchSize:
    MaxMessageCount: 10
    AbsoluteMaxBytes: 98 MB
    PreferredMaxBytes: 512 KB

TLS disabled.

If required I can perform new tests with any configuration.

Load testing

First of all I decided to test request to the state of the ledger (CouchDB). Blockchain is empty, only system data and few participants. Direct query requests to the CouchDB open port are very fast (~150 ms). My microservice connects to the Fabric by establishing a permanent connection for the existing identity. Requests take up ~500 ms in our system without high load. Half of this time accounts for message broker (AWS SQS is really slow). For load testing I'm using tool YandexTank. Load is going smoothly without latency increasing up to ~70 requests per second. Then latency stats degrade and at some point, chaincode starts return error messages. You can see test results here:

TEST RESULTS

There are two types of error messages that I received during iterations of load tests:

[Hyperledger-Composer] undefined:HLFQueryHandler :queryChaincode() query payload returned an error: Error: 2 UNKNOWN: error executing chaincode: failed to execute transaction: timeout expired while executing transaction

LFQueryHandler :queryChaincode() query payload returned an error: Error: 2 UNKNOWN: error executing chaincode: transaction returned with failure: Error: The current identity, with the name 'txBuilder' and the identifier '5606acbada327a8ef33134e601f990076872b31a3dda5ec0a983e04915d16007', has not been registered`

Chaincode container does not restart by itself, but from this time it doesn't work well. Sometimes I can't ping it, sometimes I can, but anyway latency is terrible. Only restart of the peer container can help. (I remind you that request to the ledger goes through one peer due to Composer, that's not good, but it's not the point of my investigation). The second error is really strange because this is the only identity I use and it works before chaincode starts to fail. And it works after I restart peer.

During applying the load, CPU usage of the peer, chaicode and CouchDB are the most (as expected). I'm in the middle of a configuring monitoring system for my blockchain network and soon I will be able to share more information.

Any thoughts?

UPDATE #1

I've been advised to use c*-type AWS instances for deploying Fabric. I chose c5.4xlarge (16 vCPU) for my tests. Also, I changed Fabric config a little bit:

BatchTimeout: 1s
BatchSize:
    MaxMessageCount: 20
    AbsoluteMaxBytes: 98 MB
    PreferredMaxBytes: 512 KB

I performed the same test and, to my regret, I got the same result:

TEST RESULTS

In the figure below you can see the plot of containers CPU usage during the test which lasts 1 minute

Total CPU usage in maximum was ~ 30%. So we can see that problem of latency degradation lies elsewhere.

UPDATE #2

As performance results were very poor, I decided to continue my tests with pure Fabric without any unnecessary intermediate components. Just Fabric network and nodejs SDK. See new report here

Upvotes: 8

Attempt to achieve high throughput in Hyperledger Fabric network

Project descriptions

Load testing

UPDATE #1

UPDATE #2

Answers (2)

Related Questions