Reputation: 107
We have couple HDFS directories in which data stored in delimited format. These directories created as one directory per ingestion date. These directories added as a partitions to a Hive external table.
Directory structure:
/data/table1/INGEST_DATE=20180101
/data/table1/INGEST_DATE=20180102
/data/table1/INGEST_DATE=20180103 etc.
Now we want to process this data in spark job. From the program I can directly read these HDFS directories by giving exact directory path(Option 1) or I can read from Hive into a data frame and process(Option 2).
I would like to know if there is any significant difference in following Option1 or Option2. Please let me know if need any other details. Thanks in Advance
Upvotes: 1
Views: 286
Reputation: 18128
If you want to select a subset of the columns, then that it is only possible via spark.sql. In your use case I don't think there will be a significant difference.
With Spark SQL you can get Partition pruning automatically.
Upvotes: 0