Spark HDFS Direct Read vs Hive External table read

Question

We have couple HDFS directories in which data stored in delimited format. These directories created as one directory per ingestion date. These directories added as a partitions to a Hive external table.

Directory structure:

/data/table1/INGEST_DATE=20180101

/data/table1/INGEST_DATE=20180102

/data/table1/INGEST_DATE=20180103 etc.

Now we want to process this data in spark job. From the program I can directly read these HDFS directories by giving exact directory path(Option 1) or I can read from Hive into a data frame and process(Option 2).

I would like to know if there is any significant difference in following Option1 or Option2. Please let me know if need any other details. Thanks in Advance

Spark HDFS Direct Read vs Hive External table read

Answers (1)

Related Questions