Mistapopo
Mistapopo

Reputation: 433

How to read a sub-sample of partitioned parquets using pySpark?

Assume I am provided the following partition of parquets data:

.
└── data/
    ├── product=soda/
    │   ├── <hash>_toto.parquet
    │   ├── ...
    │   └── <hash>.parquet
    └── product=cake/
        ├── <hash>.parquet
        └── ...

I would like to read the data using PySpark but by excluding a given list of parquets which contains <hash>_toto.parquet.

I can read the whole partitioned data but I have no idea how to exclude some of them. I would like to keep the feature that Spark implements to merge the data and create a column product here with the following code :

from pyspark.sql import SparkSession, SQLContext
# this code read all parquets, merge them, and create a column product
spark = SparkSession.builder \
                    .master("local") \
                    .appName("app") \
                    .config("spark.executor.memory", "5gb") \
                    .config("spark.cores.max", "6") \
                    .getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
dataframe = sqlContext.read.parquet("./data")

Upvotes: 1

Views: 351

Answers (2)

Til Piffl
Til Piffl

Reputation: 645

You could add a boolean column toto into your dataframe and partitionBy('toto'). Then you get subfolders toto=True/ and toto=False/.

After reading the complete folder you can then simply run `.where('toto == False') to prevent Spark from reading these parquets.

Upvotes: 0

blackbishop
blackbishop

Reputation: 32640

Use input_file_name function to filter out rows coming from files containing 'hash' in filename:

from pyspark.sql import functions as F

df = sqlContext.read.parquet("./data")

df = df.filter(~F.input_file_name().rlike('hash'))

Upvotes: 1

Related Questions