adding up a column of a text file using JAVA RDD in spark

Question

I'm a newbie in spark. I'm trying to read a textfile, and sum up the total of the third column. I'm a bit confused how to do it with RDD.

public class test2 {
  public static void main(String[] args) {
     String logFile = "textfile.txt"; // Should be some file on your system

     JavaSparkContext sc = new JavaSparkContext("local", "Simple App",
            "spark-0.9.1-bin-hadoop2/", new String[]{"test2_jar/test2.jar"});
     JavaRDD logData = sc.textFile(logFile).cache();
     JavaRDD tabbed = logData.flatMap(new FlatMapFunction() {
        @Override
        public Iterable call(String s) throws Exception {
            return Arrays.asList(s.split(","));
        }
     });
  }
}

this is as far as i get. How do i get the RDD to access second column after i split it? i know the summation can be done using fold. but i'm not really sure how to do it.

David · Accepted Answer

It is a bit easier to understand what is going on using the spark-shell and scala, as syntax is a bit less verbose. Then, once you understand the flow, writing it Java is much easier.

First: flatMap will take your list of log entries and append them to the RDD, so instead of having two rows like

A, B
C, D

you will end up with four rows, like

A
B
C
D

To get the behavior you want, you need to use the 'map' function

In the spark-shell, the code would look like this:

val logData = sc.textFile("textfile.txt")
val tabbed = logData.map(x => x.split(","))
val secondColumn = tabbed.map(x => x(1).toDouble)
val sum = secondColumn.sum

adding up a column of a text file using JAVA RDD in spark

Answers (1)

Related Questions