GameOfThrows
GameOfThrows

Reputation: 4510

Spark distribute local file from master to nodes

I used to run Spark locally and distributing file to nodes has never caused me problems, but now I am moving things to Amazon cluster service and things starts to break down. Basically, I am processing some IP using the Maxmind GeoLiteCity.dat, which I placed on the local file system on the master (file:///home/hadoop/GeoLiteCity.dat).

following a question from earlier, I used the sc.addFile:

sc.addFile("file:///home/hadoop/GeoLiteCity.dat")

and call on it using something like:

val ipLookups = IpLookups(geoFile = Some(SparkFiles.get("GeoLiteCity.dat")), memCache = false, lruCache = 20000)

This works when running locally on my computer, but seems to be failing on the cluster (I do not know the reason for the failure, but I would appreciate it if someone can tell me how to display the logs for the process, the logs which are generated from Amazon service do not contain any information on which step is failing).

Do I have to somehow load the GeoLiteCity.dat onto the HDFS? Are there other ways to distribute a local file from the master across to the nodes without HDFS?

EDIT: Just to specify the way I run, I wrote a json file which does multiple steps, the first step is to run a bash script which transfers the GeoLiteCity.dat from Amazon S3 to the master:

#!/bin/bash
cd /home/hadoop
aws s3 cp s3://test/GeoLiteCity.dat GeoLiteCity.dat

After checking that the file is in the directory, The json then execute the Spark Jar, but fails. The logs produced by Amazon web UI does not show where the code breaks.

Upvotes: 4

Views: 2868

Answers (1)

sag
sag

Reputation: 5461

Instead of copying the file into master, load the file into s3 and read it from there

Refer http://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter2/s3.html for reading files from S3.

You need to provide AWS Access Key ID and Secret Key. Either set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY or set it programmatically like,

sc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY)
sc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY)

Then you can just read the file as text file. Like,

 sc.textFile(s3n://test/GeoLiteCity.dat)

Additional reference : How to read input from S3 in a Spark Streaming EC2 cluster application https://stackoverflow.com/a/30852341/4057655

Upvotes: 1

Related Questions