PySpark how to get the partition name on query results?

Question

I would like to retrieve the partition name on query results.

So if I have a partition like:

dataset/foo/
        ├─ key=value1
        ├─ key=value2
        └─ key=value3

I can do this query

results = session.read.parquet('dataset/foo/key=value[12]') \
                      .select(['BAR']) \
                      .where('BAZ < 10')

Once I do this how to know the partition origin for each results ?

Indeed I can get only values form the BAR column.

Thanks for your help

notNull · Accepted Answer

Include key column in your select statement!

#read foo directory as it is partiitoned so we can filter on the key
results = session.read.parquet('foo/') \
                      .select(['BAR','key']) \
                      .filter((col("key") == "value1") & (col("BAZ") < '10'))

In case if you want to add origin filename to all records then use input_file_name()

from pyspark.sql.functions import *
results = session.read.parquet('foo/') \
                      .select(['BAR','key'])\
                      .withColumn("input_file", input_file_name()) \
                      .filter((col("key") == "value1") & (col("BAZ") < '10'))

PySpark how to get the partition name on query results?

Answers (1)

Related Questions