We are having 10 application servers and 3 kafka clusters to support application messaging request. Recently we had a situation in which due to network issue kafka cluster went down and making whole application went down for few hours as all data were lost. As i was looking for kafka disaster recovery plan and found that we should have - Fallover to another cluster in same data center Fallover to another cluster in nearby data center Fallover to another cluster in another zone data center As we have some constraints to have another data center so we were thinking of having an approach- All application server write data in file Filebeat reads the file and pushes to kafka In case of issue at kafka end, data will be available in files and can be recovered. So, my question is, Is this approach is good? Any significant issue in this architecture? Any other suggestion?

Reputation: 642

Apache Kafka disaster recovery plan

We are having 10 application servers and 3 kafka clusters to support application messaging request. Recently we had a situation in which due to network issue kafka cluster went down and making whole application went down for few hours as all data were lost. As i was looking for kafka disaster recovery plan and found that we should have -

Fallover to another cluster in same data center
Fallover to another cluster in nearby data center
Fallover to another cluster in another zone data center

As we have some constraints to have another data center so we were thinking of having an approach-

All application server write data in file
Filebeat reads the file and pushes to kafka

In case of issue at kafka end, data will be available in files and can be recovered. So, my question is, Is this approach is good? Any significant issue in this architecture? Any other suggestion?

Upvotes: 3

Answers (3)

ᐅdevrimbaris

Reputation: 796

Did you check MirrorMaker 2 (a feature that comes with Kafka 2.5+) ? It enables one way, two way replication scenarios with 2 or more clusters. It even translates consumer group offsets to the other Kafka cluster in case you take over from the other side.

Upvotes: 0

AbhishekN

Reputation: 368

Although I have not had such scenario of single DC redundancy but I can see that could be interesting for some customers. So it's a hypothetical solution.

In my opinion, it would be a bad idea to consider non-Kafka infrastructure as your backup solution. Your programmers will cry while coding since the APIs depend on a lot of Kafka related metadata to receive appropriate messages from topics and partitions. How will the application find the last record which it processing from Topic-1:Partition:27? where will future records go since producers also uses Kafka APIs.

I would build a secondary Kafka cluster, smaller compared to your main cluster with isolated brokers, zookeeper and disks. Use mirror maker or replicator (https://docs.confluent.io/current/multi-dc-replicator/mirrormaker.html) to fill this cluster with actual data. You can keep retention time lower to manage disk space etc but it will keep all of your real time applications going smoothly.

Once your main cluster goes down, applications need to use brokers of this cluster to do regular processing.

Consumer apps will need to save offsets outside of Kafka to be able to simply restart from last checkpoint. Producer apps just need to change broker id. This switch can be programmed in a proxy or an independent microservice maintaining Kafka connections if you want to go to that level.

Upvotes: 1

JR ibkr

Reputation: 919

Were your kafka brokers running on separate rack server?

It is expected that a rack server might be offline for few minutes for maintenance purposes. https://kafka.apache.org/documentation/#basic_ops_racks

It is not recommended to distribute kafka-cluster on different data centers. You may start to get network related problems when you do so.

https://kafka.apache.org/documentation/#datacenters

What if entire data center is not available?

Sue data service provider if they did not deliver their SLA. Write producer assuming that brokers might not be available. You can also look into unclean leader election.

Alternative strategy can be: as soon as your producer noticed that kafka broker is not responding. Put data into elasticsearch/some other database. So that you have something to fall back on.

If you have designed your kafka environment properly then min number of in-sync replicas and ack=all should guarantee that data exist on a machine if few brokers went down. By design, if when number of in-sync replicas > minimum number of in-sync replicas; broker will not accept a message from a producer.

Also, if data is mirrored across different clusters in different data-centers that would also give you more confidence.

Upvotes: 1

Apache Kafka disaster recovery plan

Answers (3)

Related Questions