Reputation: 57
I have a stream of xml records which I process in scala using hadoopRDD and finally save in a single file However I need to sort those XMLs based on certain attributes before saving them in output file.
I thought of creating List with xml value and xml like below
<Transaction>
<eventid>1234<eventId/>
<eventName>hello<eventName/>
.......
<Transaction/>
<Transaction>
<eventid>2345<eventId/>
<eventName>hi<eventName/>
.......
<Transaction/>
--- and so on
My idea is to create a list as {(1234, xml1),(2345,xml2)....} , sort on first element and save the second element to output file.
How can this be done in Scala , or is there a better approach to do this Thanks in advance for your suggestions and help
Upvotes: 0
Views: 166
Reputation: 57
I was able to figure it out like below: First, I have created a function to extract eventId from xml, returning both eventId and xml:
val rdd = input.map {x => (geteventId(x) , x)}
Then I sorted on eventId and extracted only xml and saved on hdfs:
val result = rdd.soryBy(x => x._1).map(x => x._2)
geteventId(x) is used by parsing xml to get the value for eventId.
Upvotes: 1