Mikel Urkia
Mikel Urkia

Reputation: 2095

Hadoop DistributedCache functionality in Spark

I am looking for a functionality similar to the distributed cache of Hadoop in Spark. I need a relatively small data file (with some index values) to be present in all nodes in order to make some calculations. Is there any approach that makes this possible in Spark?

My workaround so far consists on distributing and reducing the index file as a normal processing, which takes around 10 seconds in my application. After that, I persist the file indicating it as a broadcast variable, as follows:

JavaRDD<String> indexFile = ctx.textFile("s3n://mybucket/input/indexFile.txt",1);
ArrayList<String> localIndex = (ArrayList<String>) indexFile.collect();    

final Broadcast<ArrayList<String>> globalIndex = ctx.broadcast(indexVar);

This makes the program able to understand what the variable globalIndex contains. So far it is a patch that might be okay for me, but I consider it is not the best solution. Would it still be effective with a considerably bigger data-set or a big amount of variables?

Note: I am using Spark 1.0.0 running on a Standalone cluster located at several EC2 instances.

Upvotes: 6

Views: 4934

Answers (2)

Sai Krishna
Sai Krishna

Reputation: 624

Please have a look at SparkContext.addFile() method. Guess that is what you were looking for.

Upvotes: 6

sras
sras

Reputation: 838

As long as we use Broadcast variables, it should be effective with larger dataset as well.

From the Spark documentation "Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner."

Upvotes: 0

Related Questions