Reputation: 307
I am reading an ASCII Text file in Spark (Scala Language) that contains data in the following format:-
name|type|type_ver|id1|yyyy-mm-dd hh:mm:ss
name|type|type_ver|id2|yyyy-mm-dd hh:mm:ss
name|type|type_ver|id3|yyyy-mm-dd hh:mm:ss
name|type|type_ver||yyyy-mm-dd hh:mm:ss
I need to extract the type, typr_ver, id and timestamp columns from this and then sort the extracted entries according to the timestamp in descending order (latest timestamp comes on the top).
This is the function I'm using.
def parseTable(line: String): (String, String, String, String) = {
val fields = line.split("\\|")
val type = fields(1)
val type_ver = fields(2)
val id = fields(3)
val timeStamp = fields(4)
(type, type_ver, id, timeStamp)
}
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val conf = new SparkConf()
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val spark = sqlContext.sparkSession
import spark.implicits._
var file = spark.read.textFile("/<path to file>/<filename>).repartition(10)
val parseTables = file.map(parseTable).toDF
val pattern = "yyyy-MM-dd HH:mm:ss"
val newDF = parseTables.orderBy(unix_timestamp(parseTables(x => x._4), pattern).cast("timeStamp"))
val newTable = parseTables.coalesce(1)
newTable.saveAsTextFile("/<path to save file>/Test_1")
This val newDF solution was given on Stackoverflow itself that I was trying to implement in my code but wasn't able to. Also, the saveAsTextFile function stopped working because of this.
How do I sort the data according to the timestamp in descending order and save the output as a textfile?
Upvotes: 0
Views: 372
Reputation: 42392
You can use the Spark CSV reader to read the file into a dataframe, without manually parsing it. Then you can write out another csv file using Spark CSV writer.
val df = spark.read.option("delimiter", "|").csv("filename")
val df2 = df.orderBy(desc("_c4"))
df2.coalesce(1).write.option("delimiter", "|").csv("newfilename")
Upvotes: 1