James Davinport
James Davinport

Reputation: 287

How to transform a txt file into a parquet file and load it into a hdfs table-pyspark

I have a text file that I am trying to convert to a parquet file and then load it into a hive table by write it to it's hdfs path. Everything runs but the table shows no values.

Here is my code:

#Create my table
spark.sql("create  external table if not exists table1 ( c0 string, c1 string, c2 string)  STORED AS parquet LOCATION 'hdfs://hadoop_data/hive/table1'")

hdfs="hdfs://hadoop_data/hive/table1/output.parquet"

#Read my data file
e=spark.read.text("/home/path/sample_txt_files/sample5.txt")

#Write it to hdfs table as a parquet file
e.write.parquet("hdfs")

Everything runs but when I check the contents of the table by select * from table1, no values are there:

txt fi

The content in the sample5.txt file goes like this:

ID,Name,Age
1,James,15

Content inside the .parqeut file enter image description here

Any ideas or suggestion as to why no data is showing in the table?

Upvotes: 1

Views: 13188

Answers (2)

notNull
notNull

Reputation: 31460

Did u tried to set these parameters in hive shell as you are writing hdfs://hadoop_data/hive/table1/output.parquet directory but table is created on hdfs://hadoop_data/hive/table1/. As you are writing output.parquet nested directory.

SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;

And then check are u able to see data from hive table.

(or)

Try to insert data into table directly using .insertInto function.

e.write.format("parquet").insertInto("default.table1")

UPDATE:

As you are reading text file even though if you have 3 columns spark reads as one column(value).

e=spark.read.text("/home/path/sample_txt_files/sample5.txt") //returns dataframe

f=e.withColumn("c0",split(col("value"),",")(0)).withColumn("c1",split(col("value"),",")(1)).withColumn("c2",split(col("value"),",")(2)).drop("value") //split the column and extract data

f.write.format("parquet").insertInto("default.table1")

In case if you have csv file (or) any other delimiter file use spark.read.csv() with options to read the file.

Upvotes: 2

E.ZY.
E.ZY.

Reputation: 725

I would check the underlying parquet data type comparing to your hive schema.
being said, the id, name, age are both string in hive table. but when you write out the parquet, the data type of id and age might be int instead of string.

Upvotes: 0

Related Questions