How to transform a txt file into a parquet file and load it into a hdfs table-pyspark

Question

I have a text file that I am trying to convert to a parquet file and then load it into a hive table by write it to it's hdfs path. Everything runs but the table shows no values.

Here is my code:

#Create my table
spark.sql("create  external table if not exists table1 ( c0 string, c1 string, c2 string)  STORED AS parquet LOCATION 'hdfs://hadoop_data/hive/table1'")

hdfs="hdfs://hadoop_data/hive/table1/output.parquet"

#Read my data file
e=spark.read.text("/home/path/sample_txt_files/sample5.txt")

#Write it to hdfs table as a parquet file
e.write.parquet("hdfs")

Everything runs but when I check the contents of the table by select * from table1, no values are there:

The content in the sample5.txt file goes like this:

ID,Name,Age
1,James,15

Content inside the .parqeut file

Any ideas or suggestion as to why no data is showing in the table?

notNull · Accepted Answer

Did u tried to set these parameters in hive shell as you are writing hdfs://hadoop_data/hive/table1/output.parquet directory but table is created on hdfs://hadoop_data/hive/table1/. As you are writing output.parquet nested directory.

SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;

And then check are u able to see data from hive table.

(or)

Try to insert data into table directly using .insertInto function.

e.write.format("parquet").insertInto("default.table1")

UPDATE:

As you are reading text file even though if you have 3 columns spark reads as one column(value).

e=spark.read.text("/home/path/sample_txt_files/sample5.txt") //returns dataframe

f=e.withColumn("c0",split(col("value"),",")(0)).withColumn("c1",split(col("value"),",")(1)).withColumn("c2",split(col("value"),",")(2)).drop("value") //split the column and extract data

f.write.format("parquet").insertInto("default.table1")

In case if you have csv file (or) any other delimiter file use spark.read.csv() with options to read the file.

How to transform a txt file into a parquet file and load it into a hdfs table-pyspark

Answers (2)

UPDATE:

Related Questions