Reputation: 23
How spark process XML files in distributed manner? XML file is not splittable file right? Will it be processed only by a single node? I'm little bit confused, It would be helpful if someone help me on this query. Thanks in advance
Upvotes: 0
Views: 160
Reputation: 21
I came across the same question from the recent use case/development using Spark. From my observation of the Spark Web UI, it seems like an XML file is not splittable indeed but the transformation (read/parse..etc) seems to be handled by multiple nodes in a distributed manner. My summary is that assuming you have 100 XML files to read and process, and you have 10 nodes, then you can only process 10 files at a time and move on to the next multiple of 10. (10 -> 20 -> 30.. 100).
Upvotes: 1