Philip K. Adetiloye
Philip K. Adetiloye

Reputation: 3268

Selects fields from Spark RDD

I've got a fairly big RDD with 400 fields coming from Kafka spark stream, I need to create another RDD or Map by selecting some fields from initial RDD stream when I transform the stream and eventually writing the Elasticsearch.

I know my fields by field name but don't know the field index.

How do I project the specific fields by field name to a new Map?

Upvotes: 1

Views: 513

Answers (1)

Jayant
Jayant

Reputation: 346

  1. Assuming each field is delimited by delimiter '#'. You can determine the index for each field using the first row or header file and store in some data-structure. Subsequently, you can use this structure to determine the fields and create new maps.

  2. You can use Apache Avro format to pre-process the data. That would allow you to access the data based on their fields and would not require the knowledge of their indexes in the String. The following link provides a great starting point to integrate Avro with Kafka and Spark.

http://aseigneurin.github.io/2016/03/04/kafka-spark-avro-producing-and-consuming-avro-messages.html

Upvotes: 1

Related Questions