Reputation: 71
We have established working subscription on IBM Data Replication CDC Replication Engine for Kafka. Messages (replicated transactions) started to appear on target kafka topics.
Our goal is to create a program which reads these messages from kafka and writes to a file on target system.
How do we adjust the kafka_bookmark_storage_type
parameter?
We started to follow the instructions provided here. According to section Kafka transactionally consistent consumer, there is a prerequisite. we have to
“[..] change the system parameter kafka_bookmark_storage_type from the default value POINTBASE to the value COMMITSTREAMTOPIC.[..]“.
Could you please advise where to change the above mentioned parameter? Our target system runs on Linux. Source is running on AIX. Which leads to:
Transactionally consistent client or WebHDFS: What would we technically loose/gain in terms of functionality if we would rather use CDC for WebHDFS instead of CDC Kafka?
Upvotes: 1
Views: 646
Reputation: 71
I'm Sarah and I work for IBM. I'll answer your question in two parts:
“Our goal is to create a program which reads these messages from kafka and writes to a file on target system.”
Incorporating the TCC API into your consuming application is a means of ensuring you can recreate the original transactionality of the source data. However, you can use standard means of consuming from Kafka as well by simply reading data from the topics. In the knowledge centre you’ll see for each KCOP, the Kafka-console-consumer command to read data in the generic Kafka way. Just pointing out you have the option of both.
E.x. :
”[..] change the system parameter kafka_bookmark_storage_type from the default value POINTBASE to the value COMMITSTREAMTOPIC.[..]”
This parameter is a datastore parameter and should be set on the CDC Kafka Target instance. You can do this via MC by right clicking on the datastore and adding the parameter in.
Now the second part of your question:
“transactionally consistent client vs. WebHDFS What we would loose/gain in terms of functionality if we would rather use CDC for WebHDFS instead of CDC Kafka?”
CDC for Kafka is the product's fastest target. The architecture of Kafka more closely aligns with the stream of changes that occur on a source database. HDFS requires aggregation of messages, as Hadoop does not like many small files. Taking OLTP workload and transforming it to batch essentially is a less efficient process and is less efficient in its use of CDC resources. CDC Kafka will scale better as it can leverage parallel writing to topics etc.
Many customers who had Hadoop systems in general found that adding Kafka in front of them as a buffer for OLTP type message workloads allowed them to both access the data in real time directly from their Kafka cluster, as well as use Kafka as a buffer for performing batch aggregation when ultimately writing out to Hadoop. Some customers report success with an open source HDFS connector for Kafka that performs this task, takes data from Kafka and applies it to Hadoop, and can even convert data to parquet or Avro data file format I believe.
Upvotes: 1