Reputation: 1000
I have a dataframe that is created by a daily batch which runs for a specific day and then it is saved in HDFS (Azure Data Lake Gen 2).
It is saved using something like this
TradesWritePath="abfss://refined@"+datalakename+".dfs.core.windows.net/curated/clientscoredata/clientscoredatainput/Trades/" + TradeDatedt + "/Trades_" + tradedateInt + ".parquet"
tradesdf.write.mode("overwrite").parquet(TradesWritePath)
As you see, I do not partition the dataframe as it contains only one date.
So, as an example, the first file on the first day would be stored in the folder
Trades/2019/08/25
and then the next day, it would be in the folder
Trades/2019/08/26
Question is, when all the data has been put, will a filter predicate on dates still be pushed down and will HDFS know where to find the data instead of doing a full scan?
Or do I still have to write using Partition by option even though I am saving one day , just so Spark understands while reading and pushes it down to HDFS and HDFS also knows where to find it (instead of a fullscan)?
Next part of the question is:
When I look at the folder where the parquet file is stored, I see a lot of small "part-****.snappy.parquet files. I understand by reading some of the forum questions here that I have to use "repartition" option if I have to bring the number of files down. But question is- is it necessary? I read that too many small files will of course affect the performance, so one option could be to save it in chunks of 128 MB (if I am not sure of how the data would be used downstream meaning which columns, at the moment), so that the NameNode in HDFS is not overburdened and we do not also have too large files. Does this also apply to these "snappy parquet partition files" also?
Too many concepts and I am still trying to get my head around to find the best possible solution to store these dataframes as parquets, so any insight would be hugely appreciated.
Upvotes: 0
Views: 297
Reputation: 6159
You should not create your own directories.
Use partition
to partition by date when you write your parquet. It will handle the directories creation automatically, and it won't scan the full table when reading.
For the second part of your question, yes, the idea is to keep each partition around 128MB but to be honest, that won't cost you much, you can keep the default partition at 200.
Upvotes: 1
Reputation: 6994
Spark would know from where to pickup the data iff the data stored as
root/
date=ddmmyy/
date=dd1mm1yy1/
...
The =
sign is important. You can't have arbitrary directory structure for predicate push down. It has to be in the aforementioned format.
In your example
You need to store something like
root/
Trades=2019/08/25
Trades=2019/08/26
Spark leverages hive partition discovery mechanism to detect partitions in a table. Hive partitioning requires data to be laid in the specific manner.
Coming to the next part of your question. Keeping small files in HDFS
is very bad regardless of the type of the file. And yes it is true for the snappy partition files. You should use repartition
or coalesce
function to keep the file size close to 128 MB.
The responsibility of namenode is to track all the files in HDFS. The block size in HDFS is 128 MB. Therefore, try to keep the .parquet
file size close to 128 MB but not more. If you keep it more, the HDFS will use 2 blocks to represent the data.
Upvotes: 1