Reputation: 817
I want to convert xml files to avro. The data will be in xml format and will be hit the kafka topic first. Then, I can either use flume or spark-streaming to ingest and convert from xml to avro and land the files in hdfs. I have a cloudera enviroment.
When the avro files hit hdfs, I want the ability to read them into hive tables later.
I was wondering what is the best method to do this? I have tried automated schema conversion such as spark-avro (this was without spark-streaming) but the problem is spark-avro converts the data but hive cannot read it. Spark avro converts the xml to dataframe and then from dataframe to avro. The avro file can only be read by my spark application. I am not sure if I am using this correctly.
I think I will need to define an explicit schema for the avro schema. Not sure how to go about this for the xml file. It has multiple namespaces and is quite massive.
Upvotes: 0
Views: 1407