Ged
Ged

Reputation: 18013

SPARK collect side-effects not well understood

In the continual quest for knowledge and the enlightenment of others, I note that the below rdd3 statement works without collect but it does not with collect. I am wondering why as if you may get quite confused when reading what collect does. The examples are contrived so don't worry about that.

NP:

val rdd = sc.parallelize(List((" aaa", "x"), ("bbbb ", "y"), (" cc ", "z"), ("gggg  ", " a"), ("    ", "b")))
val rdd2 = rdd.map{ case (field1, field2) => ( field1.replaceAll(" ", ""), field1.trim, field1, field2) }.collect
val rdd3 = rdd2.map{ case (field1, field2, field3, field4) => (field1.replaceAll(" ", ""), if (field1.trim == "") " "  else field1 , field3, field4) }

Problem:

val rdd3 = rdd2.map{ case (field1, field2, field3, field4) => (field1.replaceAll(" ", ""), if (field1.trim == "") " "  else field1 , field3, field4) }.collect

Returns:

notebook:7: error: missing argument list for method collect in trait TraversableLike
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `collect _` or `collect(_)(_)` instead of `collect`.
val rdd3 = rdd2.map{ case (field1, field2, field3, field4) => (field1.replaceAll(" ", ""), if (field1.trim == "") " "  else field1 , field3, field4) }.collect

Hard to follow for the novice. How would I get around this looking at the error message?

Upvotes: 0

Views: 190

Answers (1)

Chandan Ray
Chandan Ray

Reputation: 2091

collect() method returns the output of an entire rdd/dataset as an array to the driver machine.

So in your example rdd2 returns an array to the driver machine, so you cannot use it like an RDD. Please remove collect method while creating rdd2 then your third statement of collect should work.

Upvotes: 1

Related Questions