Reputation: 824
I have a kafkaspout, 2 bolts to process data, 2 bolts to store processed data in mongodb
I am using apache flux to create topology where I am reading data into spout from kafka. Everything is running fine but every time I run the topology, it processes all the msgs in kafka from the start. and once it processes all the msgs, it does not wait for more msgs and crashes.
How can I make storm topology to process latest msgs only.
here is my topology file .yaml
name: "kafka-topology"
components:
# MongoDB mapper
- id: "block-mapper"
className: "org.apache.storm.mongodb.common.mapper.SimpleMongoMapper"
configMethods:
- name: "withFields"
args: # The following are the tuple fields to map to a MongoDB document
- ["block"]
# MongoDB mapper
- id: "transaction-mapper"
className: "org.apache.storm.mongodb.common.mapper.SimpleMongoMapper"
configMethods:
- name: "withFields"
args: # The following are the tuple fields to map to a MongoDB document
- ["transaction"]
- id: "stringScheme"
className: "org.apache.storm.kafka.StringScheme"
- id: "stringMultiScheme"
className: "org.apache.storm.spout.SchemeAsMultiScheme"
constructorArgs:
- ref: "stringScheme"
- id: "zkHosts"
className: "org.apache.storm.kafka.ZkHosts"
constructorArgs:
- "172.25.33.191:2181"
- id: "spoutConfig"
className: "org.apache.storm.kafka.SpoutConfig"
constructorArgs:
# brokerHosts
- ref: "zkHosts"
# topic
- "blockdata"
# zkRoot
- ""
# id
- "myId"
properties:
- name: "scheme"
ref: "stringMultiScheme"
- name: "ignoreZkOffsets"
value: flase
config:
topology.workers: 1
# ...
# spout definitions
spouts:
- id: "kafka-spout"
className: "org.apache.storm.kafka.KafkaSpout"
constructorArgs:
- ref: "spoutConfig"
parallelism: 1
# bolt definitions
bolts:
- id: "blockprocessing-bolt"
className: "org.apache.storm.flux.wrappers.bolts.FluxShellBolt"
constructorArgs:
# command line
- ["python", "process-bolt.py"]
# output fields
- ["block"]
parallelism: 1
# ...
- id: "transprocessing-bolt"
className: "org.apache.storm.flux.wrappers.bolts.FluxShellBolt"
constructorArgs:
# command line
- ["python", "trans-bolt.py"]
# output fields
- ["transaction"]
parallelism: 1
# ...
- id: "mongoBlock-bolt"
className: "org.apache.storm.mongodb.bolt.MongoInsertBolt"
constructorArgs:
- "mongodb://172.25.33.205:27017/testdb"
- "block"
- ref: "block-mapper"
parallelism: 1
# ...
- id: "mongoTrans-bolt"
className: "org.apache.storm.mongodb.bolt.MongoInsertBolt"
constructorArgs:
- "mongodb://172.25.33.205:27017/testdb"
- "transaction"
- ref: "transaction-mapper"
parallelism: 1
# ...
- id: "log"
className: "org.apache.storm.flux.wrappers.bolts.LogInfoBolt"
parallelism: 1
# ...
#stream definitions
# stream definitions define connections between spouts and bolts.
# note that such connections can be cyclical
# custom stream groupings are also supported
streams:
- name: "kafka --> block-Processing" # name isn't used (placeholder for logging, UI, etc.)
from: "kafka-spout"
to: "blockprocessing-bolt"
grouping:
type: SHUFFLE
- name: "kafka --> transaction-processing" # name isn't used (placeholder for logging, UI, etc.)
from: "kafka-spout"
to: "transprocessing-bolt"
grouping:
type: SHUFFLE
- name: "block --> mongo"
from: "blockprocessing-bolt"
to: "mongoBlock-bolt"
grouping:
type: SHUFFLE
- name: "transaction --> mongo"
from: "transprocessing-bolt"
to: "mongoTrans-bolt"
grouping:
type: SHUFFLE
I have tried adding property to spoutconfig for fetching latest msgs only like this
- id: "spoutConfig"
className: "org.apache.storm.kafka.SpoutConfig"
constructorArgs:
- ref: "zkHosts"
- "blockdata"
- ""
- "myId"
properties:
- name: "scheme"
ref: "stringMultiScheme"
- name: "startOffsetTime"
ref: "EarliestTime"
- name: "forceFromStart"
value: false
But It gives error no matter what I place in ref of startOffsetTime
Exception in thread "main" java.lang.IllegalArgumentException: Can not set long field org.apache.storm.kafka.KafkaConfig.startOffsetTime to null value
Upvotes: 0
Views: 557
Reputation: 3651
You need to set the startOffsetTime to kafka.api.OffsetRequest.LatestTime. As you can see at https://github.com/apache/storm/tree/64af629a19a82591dbf3428f7fd6b02f39e0723f/external/storm-kafka#kafkaconfig, the default setting will go to the earliest offset available.
The exception you're hitting seems unrelated. It looks like a Curator/Zookeeper incompatibility.
Edit: I think you're hitting this issue https://issues.apache.org/jira/browse/STORM-2978. 1.2.2 should be out soon, please try upgrading once it releases.
Edit edit: If you want to work around it without upgrading, edit the pom for your topology so it includes a dependency on Zookeeper 3.4 and not 3.5.
Upvotes: 1