user2773013
user2773013

Reputation: 3172

adding up a column of a text file using JAVA RDD in spark

I'm a newbie in spark. I'm trying to read a textfile, and sum up the total of the third column. I'm a bit confused how to do it with RDD.

public class test2 {
  public static void main(String[] args) {
     String logFile = "textfile.txt"; // Should be some file on your system

     JavaSparkContext sc = new JavaSparkContext("local", "Simple App",
            "spark-0.9.1-bin-hadoop2/", new String[]{"test2_jar/test2.jar"});
     JavaRDD<String> logData = sc.textFile(logFile).cache();
     JavaRDD<String> tabbed = logData.flatMap(new FlatMapFunction<String, String>() {
        @Override
        public Iterable<String> call(String s) throws Exception {
            return Arrays.asList(s.split(","));
        }
     });
  }
}

this is as far as i get. How do i get the RDD to access second column after i split it? i know the summation can be done using fold. but i'm not really sure how to do it.

Upvotes: 0

Views: 1914

Answers (1)

David
David

Reputation: 3261

It is a bit easier to understand what is going on using the spark-shell and scala, as syntax is a bit less verbose. Then, once you understand the flow, writing it Java is much easier.

First: flatMap will take your list of log entries and append them to the RDD, so instead of having two rows like

A, B
C, D

you will end up with four rows, like

A
B
C
D

To get the behavior you want, you need to use the 'map' function

In the spark-shell, the code would look like this:

val logData = sc.textFile("textfile.txt")
val tabbed = logData.map(x => x.split(","))
val secondColumn = tabbed.map(x => x(1).toDouble)
val sum = secondColumn.sum

Upvotes: 2

Related Questions