James Davinport
James Davinport

Reputation: 287

How to load a parquet file into a Hive Table using Spark?

So, I am trying to load a csv file, and then save it as a parquet file, and then load it into a Hive table. However whenever it load it into the table, the values are out of place and all over the place.I am using Pyspark/Hive

Here is the content in my csv file: enter image description here

Here is my code to convert csv to parquet and write it to my HDFS location:

#This creates the sparkSession 
from pyspark.sql import SparkSession
#from pyspark.sql import SQLContext
spark = (SparkSession \
        .builder \
        .appName("S_POCC") \
        .enableHiveSupport()\
        .getOrCreate())

df = spark.read.load('/user/new_file.csv', format="csv", sep=",", inferSchema="true", header="false")
df.write.save('hdfs://my_path/table/test1.parquet')

This succesfully converts it to parquet and to the path however when I load it using the following statements in Hive, it gives a weird output.

Hive statements:

drop table sndbx_test.test99 purge ;

create external table if not exists test99 ( c0 string, c1 string, c2 string, c3 string, c4 string, c5 string, c6 string);

load data inpath 'hdfs://my_path/table/test1.parquet;

Output: enter image description here

Any ideas/suggestions?

Upvotes: 3

Views: 7162

Answers (3)

Ram Ghadiyaram
Ram Ghadiyaram

Reputation: 29155

Instead of saving as parquet and then trying to insert in to hive df.write.save('hdfs://my_path/table/test1.parquet')

you can do directly like below...

df.write
.format("parquet")
.partitionBy('yourpartitioncolumns')
.saveAsTable('yourtable')

OR

df.write
.format("parquet")
.partitionBy('yourpartitioncolumns')
.insertInto('yourtable')

Note: if you dont have patition columns and is non-partition table then no need of partitionBy

Upvotes: 6

Vihit Shah
Vihit Shah

Reputation: 314

If you describe your table, it would most probably show that your table stores data in ORC format since it is default for Hive. Hence, while creating your table, make sure you mention the format in which the underlying data will be stored, in this case parquet.

Upvotes: -1

pedram bashiri
pedram bashiri

Reputation: 1376

Instead of creating a table and then loading the data into it, you can do both in one statement.

CREATE EXTERNAL TABLE IF NOT EXISTS test99 ( c0 string, c1 string, c2 string, c3 string, c4 string, c5 string, c6 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS PARQUET
LOCATION 'hdfs://my_path/table/' ;

Upvotes: 0

Related Questions