How to get or create a Hadoop client from a Spark Executor

Question

Is there any way to get the Hadoop FileSystem from a Spark Executor when performing a mapPartitions operation over a Spark Dataframe? If not, at least is there any way to get the Hadoop configuration in order to generate a new Hadoop FileSystem?

Take into account that the HDFS is kerberized.

The use-case would be something like (pseudo-code):

spark.sql("SELECT * FROM cities").mapPartitions{ iter =>
    iter.groupedBy(some-variable).foreach{ rows =>
        hadoopFS.write(rows)
    }
    TaskContext.getPartitionId
}

miguel0afd · Accepted Answer

I found the solution. Spark utils contains a very simple way of serializing the hadoop configuration: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/SerializableConfiguration.scala

How to get or create a Hadoop client from a Spark Executor

Answers (1)

Related Questions