herman
herman

Reputation: 12305

Spark: how to use SparkContext.textFile for local file system

I'm just getting started using Apache Spark (in Scala, but the language is irrelevant). I'm using standalone mode and I'll want to process a text file from a local file system (so nothing distributed like HDFS).

According to the documentation of the textFile method from SparkContext, it will

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

What is unclear for me is if the whole text file can just be copied to all the nodes, or if the input data should already be partitioned, e.g. if using 4 nodes and a csv file with 1000 lines, have 250 lines on each node.

I suspect each node should have the whole file but I'm not sure.

Upvotes: 14

Views: 42487

Answers (6)

KayV
KayV

Reputation: 13835

Add "file:///" uri in place of "file://". This solved the issue for me.

Upvotes: 1

ketankk
ketankk

Reputation: 2664

Spark-1.6.1

Java-1.7.0_99

Nodes in cluster-3(HDP).

Case 1:

Running in local mode local[n]

file:///.. and file:/.. reads file from local system

Case 2:

`--master yarn-cluster`

Input path does not exist: for file:/ and file://

And for file://

java.lang.IllegalArgumentException :Wrong FS: file://.. expected: file:///

Upvotes: 1

Manu Prakash
Manu Prakash

Reputation: 15

Proper way of using is with three slashes. Two for syntax (just like http://) and one for mount point of linux file system e.g., sc.textFile(file:///home/worker/data/my_file.txt). If you are using local mode then only file is sufficient. In case of standalone cluster, the file must be copied at each node. Note that the contents of the file must be exactly same, otherwise spark returns funny results.

Upvotes: 2

zhaozhi
zhaozhi

Reputation: 1581

prepend file:// to your local file path

Upvotes: 4

gneets
gneets

Reputation: 19

From Spark's FAQ page - If you don't use Hadoop/HDFS, "if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode."

https://spark.apache.org/faq.html

Upvotes: 2

David Gruzman
David Gruzman

Reputation: 8088

Each node should contain a whole file. In this case local file system will be logically indistinguishable from the HDFS, in respect to this file.

Upvotes: 10

Related Questions