user3796699
user3796699

Reputation: 11

spark0.9.1 foreach println not work

I use spark-assembly_2.10-0.9.0-cdh5.0.2-hadoop2.3.0-cdh5.0.2.jar and write an application, like this:

JavaRDD<String> lines = sc.textFile(fdfileString);
    System.out.println("line: " + lines.count());
    JavaRDD<String> l2 = lines.map(new Function<String, String>() {

        @Override
        public String call(String arg0) throws Exception {
            System.out.println(arg0);
            TestUsable.count++;
            return arg0;
        }

    });
    System.out.println("finished");
    System.out.println(TestUsable.count);

run command use java: java -classpath ${spark_lib} it works but does not output content:

line: 84498
finished
0

can anyone help me.thanks

Upvotes: 1

Views: 2991

Answers (1)

samthebest
samthebest

Reputation: 31533

Looks like your not really understanding the way distributed functional programming works. You cannot mutate static variables! That's why your count will be 0. And you cannot run printlns inside map operations on RDDs - they will print the value to the logs on whatever node that data happens to be on.

Next, it looks like your understanding the lazy execution paradigm of Spark - Spark won't execute code unless it needs to. In your case it doesn't need to execute the map because you haven't asked it to do anything with the result. You need to call an action: http://spark.apache.org/docs/latest/programming-guide.html#actions

I think in your case you want to use the foreach action.

Another point is that you will ultimately iterate over the data twice (once to count, once to println on the local node), so you probably want to just toArray it and process it locally (which kinda makes using Spark pointless). What exactly do you want to do with your data?? If your more specific then I can explain more about how to do that in a distributed FP way.

Also strongly recomend you do it Scala, when programs start to get complicated it's extremly verbose doing it in Java. In Scala, to do what you want is just this:

lines = sc.textFile(fdfileString)
println("lines count = " + lines)
lines.foreach(println)
println("finished")

Upvotes: 3

Related Questions