LateCoder
LateCoder

Reputation: 2283

Get source files for directory of parquet tables in spark

I have some code where I read in many parquet tables via a directory and wildcard, like this:

df = sqlContext.read.load("some_dir/*")

Is there some way I can get the source file for each row in the resulting DataFrame, df?

Upvotes: 2

Views: 2272

Answers (1)

eliasah
eliasah

Reputation: 40380

Let's create some dummy data and save it in parquet format.

spark.range(1,1000).write.save("./foo/bar")
spark.range(1,2000).write.save("./foo/bar2")
spark.range(1,3000).write.save("./foo/bar3")

Now we can read the data as desired :

import org.apache.spark.sql.functions.input_file_name

spark.read.load("./foo/*")
     .select(input_file_name(), $"id")
     .show(3,false)
// +---------------------------------------------------------------------------------------+---+
// |INPUT_FILE_NAME()                                                                      |id |
// +---------------------------------------------------------------------------------------+---+
// |file:/home/eliasah/foo/bar/part-r-00002-9554d123-23fc-4524-a900-1cdbd9274cc3.gz.parquet|500|
// |file:/home/eliasah/foo/bar/part-r-00002-9554d123-23fc-4524-a900-1cdbd9274cc3.gz.parquet|501|
// |file:/home/eliasah/foo/bar/part-r-00002-9554d123-23fc-4524-a900-1cdbd9274cc3.gz.parquet|502|
// +---------------------------------------------------------------------------------------+---+

Since Spark 1.6, you can combine parquet data source and input_file_name function as shown above.

This seem to be buggy before spark 2.x with pyspark, but this is how it's done :

from pyspark.sql.functions import input_file_name

spark.read.load("./foo/*") \
     .select(input_file_name(), "id") \
     .show(3,truncate=False)
# +---------------------------------------------------------------------------------------+---+
# |INPUT_FILE_NAME()                                                                      |id |
# +---------------------------------------------------------------------------------------+---+
# |file:/home/eliasah/foo/bar/part-r-00002-9554d123-23fc-4524-a900-1cdbd9274cc3.gz.parquet|500|
# |file:/home/eliasah/foo/bar/part-r-00002-9554d123-23fc-4524-a900-1cdbd9274cc3.gz.parquet|501|
# |file:/home/eliasah/foo/bar/part-r-00002-9554d123-23fc-4524-a900-1cdbd9274cc3.gz.parquet|502|
# +---------------------------------------------------------------------------------------+---+

Upvotes: 3

Related Questions