Reputation: 1808
I would like to retrieve the partition name on query results.
So if I have a partition like:
dataset/foo/
├─ key=value1
├─ key=value2
└─ key=value3
I can do this query
results = session.read.parquet('dataset/foo/key=value[12]') \
.select(['BAR']) \
.where('BAZ < 10')
Once I do this how to know the partition origin for each results ?
Indeed I can get only values form the BAR
column.
Thanks for your help
Upvotes: 2
Views: 1393
Reputation: 31460
Include key
column in your select statement!
#read foo directory as it is partiitoned so we can filter on the key
results = session.read.parquet('foo/') \
.select(['BAR','key']) \
.filter((col("key") == "value1") & (col("BAZ") < '10'))
In case if you want to add origin filename to all records then use input_file_name()
from pyspark.sql.functions import *
results = session.read.parquet('foo/') \
.select(['BAR','key'])\
.withColumn("input_file", input_file_name()) \
.filter((col("key") == "value1") & (col("BAZ") < '10'))
Upvotes: 2