Reputation: 3172
I'm a newbie in spark. I'm trying to read a textfile, and sum up the total of the third column. I'm a bit confused how to do it with RDD.
public class test2 {
public static void main(String[] args) {
String logFile = "textfile.txt"; // Should be some file on your system
JavaSparkContext sc = new JavaSparkContext("local", "Simple App",
"spark-0.9.1-bin-hadoop2/", new String[]{"test2_jar/test2.jar"});
JavaRDD<String> logData = sc.textFile(logFile).cache();
JavaRDD<String> tabbed = logData.flatMap(new FlatMapFunction<String, String>() {
@Override
public Iterable<String> call(String s) throws Exception {
return Arrays.asList(s.split(","));
}
});
}
}
this is as far as i get. How do i get the RDD to access second column after i split it? i know the summation can be done using fold. but i'm not really sure how to do it.
Upvotes: 0
Views: 1914
Reputation: 3261
It is a bit easier to understand what is going on using the spark-shell and scala, as syntax is a bit less verbose. Then, once you understand the flow, writing it Java is much easier.
First: flatMap will take your list of log entries and append them to the RDD, so instead of having two rows like
A, B
C, D
you will end up with four rows, like
A
B
C
D
To get the behavior you want, you need to use the 'map' function
In the spark-shell, the code would look like this:
val logData = sc.textFile("textfile.txt")
val tabbed = logData.map(x => x.split(","))
val secondColumn = tabbed.map(x => x(1).toDouble)
val sum = secondColumn.sum
Upvotes: 2