Reputation: 287
So, I am trying to load a csv file, and then save it as a parquet file, and then load it into a Hive table. However whenever it load it into the table, the values are out of place and all over the place.I am using Pyspark/Hive
Here is the content in my csv file:
Here is my code to convert csv to parquet and write it to my HDFS location:
#This creates the sparkSession
from pyspark.sql import SparkSession
#from pyspark.sql import SQLContext
spark = (SparkSession \
.builder \
.appName("S_POCC") \
.enableHiveSupport()\
.getOrCreate())
df = spark.read.load('/user/new_file.csv', format="csv", sep=",", inferSchema="true", header="false")
df.write.save('hdfs://my_path/table/test1.parquet')
This succesfully converts it to parquet and to the path however when I load it using the following statements in Hive, it gives a weird output.
Hive statements:
drop table sndbx_test.test99 purge ;
create external table if not exists test99 ( c0 string, c1 string, c2 string, c3 string, c4 string, c5 string, c6 string);
load data inpath 'hdfs://my_path/table/test1.parquet;
Any ideas/suggestions?
Upvotes: 3
Views: 7162
Reputation: 29155
Instead of saving as parquet and then trying to insert in to hive df.write.save('hdfs://my_path/table/test1.parquet')
you can do directly like below...
df.write
.format("parquet")
.partitionBy('yourpartitioncolumns')
.saveAsTable('yourtable')
OR
df.write
.format("parquet")
.partitionBy('yourpartitioncolumns')
.insertInto('yourtable')
Note: if you dont have patition columns and is non-partition table then no need of partitionBy
Upvotes: 6
Reputation: 314
If you describe your table, it would most probably show that your table stores data in ORC format since it is default for Hive. Hence, while creating your table, make sure you mention the format in which the underlying data will be stored, in this case parquet.
Upvotes: -1
Reputation: 1376
Instead of creating a table and then loading the data into it, you can do both in one statement.
CREATE EXTERNAL TABLE IF NOT EXISTS test99 ( c0 string, c1 string, c2 string, c3 string, c4 string, c5 string, c6 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS PARQUET
LOCATION 'hdfs://my_path/table/' ;
Upvotes: 0