Reputation: 159
I am new to spark and trying to understand the code in my project and to work on it. While creating spark session , i see in code one entry for config as - .config("spark.yarn.jars", "local:/cloudera/opt/xx/xxjars/*") .
I could not understand the URI scheme mention as "local:/". What does it mean by , can someone please help ?
I did some google and find one page mentioning it as scheme , but couldn't find any detail that what it is referring to ?
Upvotes: 1
Views: 1453
Reputation: 518
As I understand it, "local://path/to/file" means that the file-path is expected to be on the local filesystem of each worker node as opposed to the hdfs for example (hdfs:///path/to/file).
So in the former case the file has to reside on each node's individual filesystem, in the latter case it is enough if it is somewhere in hdfs and will be downloaded to the nodes when firing up the spark context.
The behaviour is explained in the Spark Documentation:
Spark uses the following URL scheme to allow different strategies for disseminating jars:
- file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.
- hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as expected
- local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.
For large files it is better to use local mode or to have them in hdfs, but have the replication factor = number of nodes so the hdfs-replication-location of the file is indeed always the same node your container is running on.
Upvotes: 1