Anuj jain
Anuj jain

Reputation: 513

Flink Generate Dynamic Stream from GenericRecord Stream

I have a use case where multiple types of Avro records are coming in single Kafka topic as we are suing TopicRecordNameStrategy for the subject in the schema registry.

Now I have written a consumer to read that topic and build a Datastream of GenericRecord. Now I can not sink this stream to hdfs/s3 in parquet format as this stream contains different types of schema records. So I am filtering different records for each type by applying a filter and creating different streams and then sinking each stream separately.

Below is the code that I am using--- ``

import io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient;
import io.confluent.kafka.schemaregistry.client.SchemaMetadata;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;
import org.apache.flink.api.common.ExecutionConfig;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.serialization.SimpleStringEncoder;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.core.fs.Path;
import org.apache.flink.formats.parquet.avro.ParquetAvroWriters;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.InputStream;
import java.util.List;
import java.util.Properties;

public class EventStreamProcessor {

    private static final Logger LOGGER = LoggerFactory.getLogger(EventStreamProcessor.class);
    private static final String KAFKA_TOPICS = "events";
    private static Properties properties = new Properties();
    private static String schemaRegistryUrl = "";

    private static CachedSchemaRegistryClient registryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 1000);

    public static void main(String args[]) throws Exception {

        ParameterTool para = ParameterTool.fromArgs(args);
        InputStream inputStreamProperties = EventStreamProcessor.class.getClassLoader().getResourceAsStream(para.get("properties"));
        properties.load(inputStreamProperties);
        int numSlots = para.getInt("numslots", 1);
        int parallelism = para.getInt("parallelism");
        String outputPath = para.get("output");

        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(parallelism);
        env.getConfig().enableForceAvro();

        env.enableCheckpointing(60000);

        ExecutionConfig executionConfig = env.getConfig();
        executionConfig.disableForceKryo();
        executionConfig.enableForceAvro();

        FlinkKafkaConsumer kafkaConsumer010 = new FlinkKafkaConsumer(KAFKA_TOPICS,
                new KafkaGenericAvroDeserializationSchema(schemaRegistryUrl),
                properties);

        Path path = new Path(outputPath);

        DataStream<GenericRecord> dataStream = env.addSource(kafkaConsumer010).name("bike_flow_source");

        try {
            final StreamingFileSink<GenericRecord> sink = StreamingFileSink.forBulkFormat
                    (path, ParquetAvroWriters.forGenericRecord(SchemaUtils.getSchema("events-com.events.search_list")))
                    .withBucketAssigner(new EventTimeBucketAssigner())
                    .build();

            dataStream.filter((FilterFunction<GenericRecord>) genericRecord -> {
                if (genericRecord.get(Constants.EVENT_NAME).toString().equals("search_list")) {
                    return true;
                }
                return false;
            }).addSink(sink).name("search_list_sink").setParallelism(parallelism);


            final StreamingFileSink<GenericRecord> sink_search_details = StreamingFileSink.forBulkFormat
                    (path, ParquetAvroWriters.forGenericRecord(SchemaUtils.getSchema("events-com.events.search_details")))
                    .withBucketAssigner(new EventTimeBucketAssigner())
                    .build();

            dataStream.filter((FilterFunction<GenericRecord>) genericRecord -> {
                if (genericRecord.get(Constants.EVENT_NAME).toString().equals("search_details")) {
                    return true;
                }
                return false;
            }).addSink(sink_search_details).name("search_details_sink").setParallelism(parallelism);



            final StreamingFileSink<GenericRecord> search_list = StreamingFileSink.forBulkFormat
                    (path, ParquetAvroWriters.forGenericRecord(SchemaUtils.getSchema("events-com.events.search_list")))
                    .withBucketAssigner(new EventTimeBucketAssigner())
                    .build();

            dataStream.filter((FilterFunction<GenericRecord>) genericRecord -> {
                if (genericRecord.get(Constants.EVENT_NAME).toString().equals("search_list")) {
                    return true;
                }
                return false;
            }).addSink(search_list).name("search_list_sink").setParallelism(parallelism);


        } catch (Exception e) {
            LOGGER.info("exception in sinking event");
        }
        env.execute("event_stream_processor");
    }
}
``

So this looks very inefficient to me as

  1. Every time a new event is added I have to do code change.
  2. I have to create multiple streams by the filter and all.

So please suggest to me is it possible to write a GenericRecord stream without creating multiple streams. If not how can I make this code more dynamic using some config file so every time I do not have to write the same code again for a new event?

Please suggest some better way to solve this problem.


I am trying liks this but its not working ....


        for (EventConfig eventConfig : eventTypesList) {
            LOGGER.info("creating a stream for ", eventConfig.getEvent_name());
            String key = eventConfig.getEvent_name();
            final StreamingFileSink<GenericRecord> sink = StreamingFileSink.forBulkFormat
                    (path, ParquetAvroWriters.forGenericRecord(SchemaUtils.getSchema(eventConfig.getSchema_subject())))
                    .withBucketAssigner(new EventTimeBucketAssigner())
                    .build();

            DataStream<GenericRecord> stream = dataStream.filter((FilterFunction<GenericRecord>) genericRecord -> {
                if (genericRecord.get(EVENT_NAME).toString().equals(eventConfig.getEvent_name())) {
                    return true;
                }
                return false;
            });

            Tuple2<DataStream<GenericRecord>, StreamingFileSink<GenericRecord>> tuple2 = new Tuple2<>(stream, sink);
            streamMap.put(key, tuple2);
        }


        DataStream<GenericRecord> searchStream = streamMap.get(SEARCH_LIST_KEYLESS).getField(0);
        searchStream.map(new Enricher()).addSink(streamMap.get(SEARCH_LIST_KEYLESS).getField(1));

Please provide the correct way to achieve this.

Thanks.

Upvotes: 1

Views: 2692

Answers (1)

Dominik Wosiński
Dominik Wosiński

Reputation: 3874

Well You can simply pass the list of possible message types as config parameter and then simply iterate over this. You would have something like this :

messageTypes.foreach( msgType => {
    final StreamingFileSink<GenericRecord> sink = StreamingFileSink.forBulkFormat
                    (path, ParquetAvroWriters.forGenericRecord(SchemaUtils.getSchema(msgType)))
                    .withBucketAssigner(new EventTimeBucketAssigner())
                    .build();

            dataStream.filter((FilterFunction<GenericRecord>) genericRecord -> {
                if (genericRecord.get(Constants.EVENT_NAME).toString().equals(msgType)) {
                    return true;
                }
                return false;
            }).addSink(sink).name(msgType+"_sink").setParallelism(parallelism);

}})

This would mean that You only need to restart the job with changed config when new message type arrives.

Upvotes: 3

Related Questions