miguel0afd
miguel0afd

Reputation: 315

How to get or create a Hadoop client from a Spark Executor

Is there any way to get the Hadoop FileSystem from a Spark Executor when performing a mapPartitions operation over a Spark Dataframe? If not, at least is there any way to get the Hadoop configuration in order to generate a new Hadoop FileSystem?

Take into account that the HDFS is kerberized.

The use-case would be something like (pseudo-code):

spark.sql("SELECT * FROM cities").mapPartitions{ iter =>
    iter.groupedBy(some-variable).foreach{ rows =>
        hadoopFS.write(rows)
    }
    TaskContext.getPartitionId
}

Upvotes: 3

Views: 2222

Answers (1)

miguel0afd
miguel0afd

Reputation: 315

I found the solution. Spark utils contains a very simple way of serializing the hadoop configuration: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/SerializableConfiguration.scala

Upvotes: 5

Related Questions