Reputation: 315
Is there any way to get the Hadoop FileSystem from a Spark Executor when performing a mapPartitions operation over a Spark Dataframe? If not, at least is there any way to get the Hadoop configuration in order to generate a new Hadoop FileSystem?
Take into account that the HDFS is kerberized.
The use-case would be something like (pseudo-code):
spark.sql("SELECT * FROM cities").mapPartitions{ iter =>
iter.groupedBy(some-variable).foreach{ rows =>
hadoopFS.write(rows)
}
TaskContext.getPartitionId
}
Upvotes: 3
Views: 2222
Reputation: 315
I found the solution. Spark utils contains a very simple way of serializing the hadoop configuration: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/SerializableConfiguration.scala
Upvotes: 5