sunitha
sunitha

Reputation: 1538

Using java collections in spark programs

I have a doubt in using java collections in spark programs? I got to know the following from spark programming guide.

The first way to reduce memory consumption is to avoid the Java features that add overhead, such as pointer-based data structures and wrapper objects. There are several ways to do this:

Design your data structures to prefer arrays of objects, and primitive types, instead of the standard Java or Scala collection classes (e.g. HashMap). The fastutil library provides convenient collection classes for primitive types that are compatible with the Java standard library.

Does this mean we shouldn't use java collections instead we must go for array of objects? Is the following code fine?

Map<String, String> lookUpMap = getLkp(path);
final Broadcast<<Map<String, String>> lookupBrdcst = sparkContext.broadcast(lookUpMap);

Upvotes: 2

Views: 731

Answers (1)

Binary Nerd
Binary Nerd

Reputation: 13902

This is fine, assuming the size of the HashMap isn't too large. If it gets large you would probably want to use a join.

Your code does have a slight syntax error:

final Broadcast<<Map<String, String>> lookupBrdcst = sparkContext.broadcast(lookUpMap);

should be:

final Broadcast<Map<String, String>> lookupBrdcst = sparkContext.broadcast(lookUpMap);

You can see Java collections used as broadcast variables in the Spark examples themselves:

https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaRecoverableNetworkWordCount.java

This example uses List<String> as a broadcast variable.

Upvotes: 1

Related Questions