Reputation: 297
I am working on SPARK. And my objective is to read lines from a file and sorted them based on hash. I understood that we get the file as RDD of lines. So is there a way by which i can iterate over this RDD so that i can read line by line. So i want to be able to convert it to Iterator type.
Am i limited to applying some transformation function on it in order to get it working. Following the lazy execution concept of Spark
So far i have tried this following transformation technique code.
SparkConf sparkConf = new SparkConf().setAppName("Sorting1");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
JavaRDD<String> lines = ctx.textFile("hdfs://localhost:9000/hash-example-output/part-r-00000", 1);
lines = lines.filter(new Function<String, Boolean>()
{
@Override
public Boolean call(String s) {
String str[] = COMMA.split(s);
unsortedArray1[i] = Long.parseLong(str[str.length-1]);
i++;
return s.contains("error");
}
});
lines.count();
ctx.stop();
sort(unsortedArray1);
Upvotes: 1
Views: 7204
Reputation: 1449
Try collect():
List<String> list = lines.collect();
Collections.sort(list);
Upvotes: 1
Reputation: 441
If you want to sort string in RDD, you could use takeOrdered function
takeOrdered
java.util.List takeOrdered(int num, java.util.Comparator comp)
Returns the first K elements from this RDD as defined by the specified Comparator[T] and maintains the order.
Parameters: num - the number of top elements to return comp - the comparator that defines the order
Returns: an array of top elements
or
takeOrdered
java.util.List takeOrdered(int num)
Returns the first K elements from this RDD using the natural ordering for T while maintain the order.
Parameters: num - the number of top elements to return
Returns: an array of top elements
so you could do
List<String> sortedLines = lines.takeOrdered(lines.count());
ctx.stop();
since RDD are distributed and shuffeled for each transformation, it's kinda useless to sort when it's still in RDD form, because when sorted RDD transformed, it will be shuffled (cmiiw)
but take a look at JavaPairRDD.sortByKey()
Upvotes: 2