Adjusting columns from txt to parquet

Question

this is my first time trying to convert a txt file to parquet format so please bear with me.

I have a txt file which originally looks like this:

id|roads|weights
a01|1026|1172|1
a02|DT:SR:0|2|NE|DT:SR:1|2|NE|1
b01|DT:SR:1|7|SW|1

And I'd like to make it to parquet format like this:

+---+-------------------------+-------+
|id |roads                    |weights|
+---+-------------------------+-------+
|a01|1026|1172                |1      |
|a02|DT:SR:0|2|NE|DT:SR:1|2|NE|1      |
|b01|DT:SR:1|7|SW             |1      |

So far, I have uploaded my txt file to the HDFS, and tried to use spark to convert it to parquet format with:

val textfile = spark.read.text("hdfs:some/path/file.txt")
textfile.write.parquet("some.parquet")
val parquetfile = spark.read.parquet("hdfs:some/path/some.parquet")

But I my column names are now considered a row and everything has been put together as a single column call "value".

Any help would be appreciated!

linusRian · Accepted Answer

read.text loads the text file and returns a single column named "value".You can make use of read.csv to read the delimited file .The following piece of code should work for you.

val textFile=spark.read.option("delimiter","|").option("header",true).csv("hdfs:some/path/file.txt")
textFile.write.parquet(parquet_file_path)

Adjusting columns from txt to parquet

Answers (1)

Related Questions