SundarG
SundarG

Reputation: 107

Spark HDFS Direct Read vs Hive External table read

We have couple HDFS directories in which data stored in delimited format. These directories created as one directory per ingestion date. These directories added as a partitions to a Hive external table.

Directory structure:

/data/table1/INGEST_DATE=20180101

/data/table1/INGEST_DATE=20180102

/data/table1/INGEST_DATE=20180103 etc.

Now we want to process this data in spark job. From the program I can directly read these HDFS directories by giving exact directory path(Option 1) or I can read from Hive into a data frame and process(Option 2).

I would like to know if there is any significant difference in following Option1 or Option2. Please let me know if need any other details. Thanks in Advance

Upvotes: 1

Views: 286

Answers (1)

Ged
Ged

Reputation: 18128

If you want to select a subset of the columns, then that it is only possible via spark.sql. In your use case I don't think there will be a significant difference.

With Spark SQL you can get Partition pruning automatically.

Upvotes: 0

Related Questions