Reputation: 457
I am trying to generate stream data, to simulate a situation where I receive two values, Integer type, in a different time range, with timestamps, and Kafka as connector.
I am using Flink environment as a consumer, but I don't know which is the best solution for the producer. (Java syntax better than Scala if possible)
Should I produce the data direct from Kafka? If yes, what is the best way to do it? Or maybe is better if I produce the data from Flink as a producer, send it to Kafka and consume it at the end by Flink again? How can I do that from flink? Or perhaps there is another easy way to generate stream data and pass it to Kafka.
If yes, please put me on the track to achieve it.
Upvotes: 2
Views: 2274
Reputation: 316
For the described use-case I would recommend the datagen CLI open-source project.
It is a very simple tool for generating realistic fake stream data for Kafka. It uses the Faker.js API, you can create custom schemas to simulate various use cases. It also supports JSON, Avro, and SQL schemas, you can easily produce data in your desired format and even establish relationships between datasets. This enables you to generate meaningful data streams for testing, development, or demonstration purposes in a Kafka environment.
For the use-case that you've described, you can generate 2k users with the following schema:
[
{
"_meta": {
"topic": "users",
"key": "id"
},
"id": "faker.datatype.number({min: 1, max: 2000})",
"name": "faker.name.fullName()",
"email": "faker.internet.email()",
"registered_at": "faker.date.past(5, '2023-01-01').getTime()"
}
]
And produce the data with the following command:
datagen -s users.json -n 2000 --dry-run
The tool also let's you generate relational data that you could query in your downstream apps.
Upvotes: 0
Reputation: 368
As David also mentioned, you can create a dummy producer in simple Java using KafkaProducer APIs to schedule and send messages to Kafka as per you wish. Similarly you can do that with Flink if you want multiple simultaneous producers. With Flink you will need to write a separate job for producer and consumer. Kafka basically enables an ASync processing architecture so it does not have queue mechanisms. So better to keep producer and consumer jobs separate.
But think a little bit more about the intention of this test:
Are you trying to test Kafka streaming durability, replication, offset management capabilities
In this case, you need simultaneous producers for same topic, with null or non-null key in the message.
or Are you trying to test Flink-Kafka connector capabilities.
In this case, you need only one producer, few internal scenarios could be back pressure test by making producer push more messages than consumer can handle.
or Are you trying to test topic partitioning and Flink streaming parallelism.
In this case, single or multiple producers but key of message should be non-null, you can test how Flink executors are connecting with individual partitions and observe their behavior.
There are more ideas you may want to test and each of these will need something specific to be done in producer or not to be done.
You can check out https://github.com/abhisheknegi/twitStream
for pulling tweets using Java APIs in case needed.
Upvotes: 1