Reputation: 67
I'm using Cloudera VM, a linux terminal and spark version 1.6.0
Let's say I have the following dataset:
Priority, qty, sales => I'm not importing headers.
low,6,261.54
high,44,1012
low,1,240
high,25,2500
I can load, "val inputFile = sc.textFile("file:///home/cloudera/stat.txt")
I can sort, "inputFile.sortBy(x=>x(1),true).collect
but I want to place low and high priority data into 2 separate files.
Would that be a filter or reduceby or partitioning? how best could I do that? If I can get help with that I think I might be able to wrap my head around creating an RDD of priority & sales, qty & sales.
Upvotes: 0
Views: 78
Reputation: 1771
It's maybe not the best solution but you can use 2 filter to create 2 differents RDD, one filter remove low line, the other one high line , then save under HDFS.
inputFile.filter($"Priority" == "low").saveAsTextFile("low_file");
inputFile.filter($"Priority" == "high").saveAsTextFile("high_file");
Upvotes: 0