Reputation: 433
Assume I am provided the following partition of parquets data:
.
└── data/
├── product=soda/
│ ├── <hash>_toto.parquet
│ ├── ...
│ └── <hash>.parquet
└── product=cake/
├── <hash>.parquet
└── ...
I would like to read the data using PySpark but by excluding a given list of parquets which contains <hash>_toto.parquet
.
I can read the whole partitioned data but I have no idea how to exclude some of them. I would like to keep the feature that Spark implements to merge the data and create a column product
here with the following code :
from pyspark.sql import SparkSession, SQLContext
# this code read all parquets, merge them, and create a column product
spark = SparkSession.builder \
.master("local") \
.appName("app") \
.config("spark.executor.memory", "5gb") \
.config("spark.cores.max", "6") \
.getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
dataframe = sqlContext.read.parquet("./data")
Upvotes: 1
Views: 351
Reputation: 645
You could add a boolean column toto
into your dataframe and partitionBy('toto')
. Then you get subfolders toto=True/ and toto=False/.
After reading the complete folder you can then simply run `.where('toto == False') to prevent Spark from reading these parquets.
Upvotes: 0
Reputation: 32640
Use input_file_name
function to filter out rows coming from files containing 'hash'
in filename:
from pyspark.sql import functions as F
df = sqlContext.read.parquet("./data")
df = df.filter(~F.input_file_name().rlike('hash'))
Upvotes: 1