Reputation: 611
this is my first time trying to convert a txt file to parquet format so please bear with me.
I have a txt file which originally looks like this:
id|roads|weights
a01|1026|1172|1
a02|DT:SR:0|2|NE|DT:SR:1|2|NE|1
b01|DT:SR:1|7|SW|1
And I'd like to make it to parquet format like this:
+---+-------------------------+-------+
|id |roads |weights|
+---+-------------------------+-------+
|a01|1026|1172 |1 |
|a02|DT:SR:0|2|NE|DT:SR:1|2|NE|1 |
|b01|DT:SR:1|7|SW |1 |
So far, I have uploaded my txt file to the HDFS, and tried to use spark to convert it to parquet format with:
val textfile = spark.read.text("hdfs:some/path/file.txt")
textfile.write.parquet("some.parquet")
val parquetfile = spark.read.parquet("hdfs:some/path/some.parquet")
But I my column names are now considered a row and everything has been put together as a single column call "value".
Any help would be appreciated!
Upvotes: 0
Views: 410
Reputation: 340
read.text loads the text file and returns a single column named "value".You can make use of read.csv to read the delimited file .The following piece of code should work for you.
val textFile=spark.read.option("delimiter","|").option("header",true).csv("hdfs:some/path/file.txt")
textFile.write.parquet(parquet_file_path)
Upvotes: 1