Reputation: 21

Best approach to load data in Hive using Scala for multiple single record files

I have multiple files from source where each file contains single record details in format. Eg.

Cust ID: 123,Cust name: XYZ, Some notes to be ignored, E-mail: [email protected]

I have 10000 such files. I am reading each of these files individually through a loop in Scala code, extracting fieldname and value from each file and storing in a Spark dataframe. I am then making the necessary transformation in Spark DF and storing the DF in a hive table. The problem is that the approach is taking too long to load data into Hive. The problem maybe because each file is being opened individually and getting processed. Is there any other way of speeding up the process?

Upvotes: 0

Answers (2)

Rishu S

Reputation: 3968

You can try storing all of the data files in a single folder location and build an external HIVE table on top of these files. Assuming it is a CSV file and all of your 10,000 files are having same structure, then building an external hive table would help you with your query.

In your external table; you can ignore the headers using the tblproperties as below:

CREATE EXTERNAL TABLE <tableName>(
  col1 string,
  col2 int
  ....
)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
LOCATION '<hdfs location of your files>'
tblproperties ("skip.header.line.count"="1")

Well at this stage you will have a table with all of your dataset (unformatted). You can later create another table or views on top of your external table where you can format your data as per requirement. There are multiple functions in hive to achieve the desired output.

Hope this helps :)

Note: This solution doesn't use spark.

Upvotes: 0

Shantanu Sharma

Reputation: 4089

Try This

Read all the required files and store into dataframe.
Merge/Union individual dataframes into single dataframe.
Repartition combined dataframe.
Store the repartitioned dataframe to Hive table.

Read all files and store into dataframes

val dfs = Seq('file1','file2').map { i =>
df = spark.read.format("csv").option("header", "true").load(s"${'"'}$i${'"'})
}

Reduce dataframes into one

val unionDF = dfs.reduceLeft((dfa, dfb) => 
  dfa.unionAll(dfb)
)

Repartition combined dataframe.

unionDF.coalesce(10)

Write to hive table

unionDF.write().mode("overwrite").saveAsTable("database.table")

Let me know how it goes.

Upvotes: 1

Best approach to load data in Hive using Scala for multiple single record files

Answers (2)

Related Questions