Little Bobby Tables
Little Bobby Tables

Reputation: 5351

Read files sent with spark-submit by the driver

I am sending a Spark job to run on a remote cluster by running

spark-submit ... --deploy-mode cluster --files some.properties ...

I want to read the content of the some.properties file by the driver code, i.e. before creating the Spark context and launching RDD tasks. The file is copied to the remote driver, but not to the driver's working directory.

The ways around this problem that I know of are:

  1. Upload the file to HDFS
  2. Store the file in the app jar

Both are inconvenient since this file is frequently changed on the submitting dev machine.

Is there a way to read the file that was uploaded using the --files flag during the driver code main method?

Upvotes: 46

Views: 110856

Answers (8)

Vladimir Shadrin
Vladimir Shadrin

Reputation: 376

Stuck with a same question, did it like here: How to load local csv file in spark via --files option

val from = new File(file).getAbsoluteFile.toPath

Upvotes: 0

minisheep
minisheep

Reputation: 111

The --files and --archives options support specifying file names with the # , just like Hadoop.

For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into Spark worker directory, but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.

this works for my spark streaming application in both yarn/client and yarn/cluster mode.

Upvotes: 8

kuixiong
kuixiong

Reputation: 523

In pyspark, I find it really interesting to achieve this easily, first arrange your working directory like this:

/path/to/your/workdir/
|--code.py
|--file.txt

and then in your code.py main function, just read the file as usual:

if __name__ == "__main__":
    content = open("./file.txt").read()

then submit it without any specific configurations as follows:

spark-submit code.py

it runs correctly which amazes me. I suppose the submit process archives any files and sub-dir files altogether and sends them to the driver in pyspark, while you should archive them yourself in scala version. By the way, both --files and --archives options are working in worker not the driver, which means you can only access these files in RDD transformations or actions.

Upvotes: 1

NicolasLi
NicolasLi

Reputation: 117

use spark-submit --help, will find that this option is only for working directory of executor not driver.

--files FILES: Comma-separated list of files to be placed in the working directory of each executor.

Upvotes: 6

Prashant Sahoo
Prashant Sahoo

Reputation: 1095

After the investigation, I found one solution for above issue. Send the any.properties configuration during spark-submit and use it by spark driver before and after SparkSession initialization. Hope it will help you.

any.properties

spark.key=value
spark.app.name=MyApp

SparkTest.java

import com.typesafe.config.Config;
import com.typesafe.config.ConfigFactory;

public class SparkTest{

  public Static void main(String[] args){

    String warehouseLocation = new File("spark-warehouse").getAbsolutePath();

    Config conf = loadConf();
    System.out.println(conf.getString("spark.key"));

    // Initialize SparkContext and use configuration from properties
    SparkConf sparkConf = new SparkConf(true).setAppName(conf.getString("spark.app.name"));

    SparkSession sparkSession = 
    SparkSession.builder().config(sparkConf).config("spark.sql.warehouse.dir", warehouseLocation)
                .enableHiveSupport().getOrCreate();

    JavaSparkContext javaSparkContext = new JavaSparkContext(sparkSession.sparkContext());

  }


  public static Config loadConf() {

      String configFileName = "any.properties";
      System.out.println(configFileName);
      Config configs = ConfigFactory.load(ConfigFactory.parseFile(new java.io.File(configFileName)));
      System.out.println(configs.getString("spark.key")); // get value from properties file
      return configs;
   }
}

Spark Submit:

spark-submit --class SparkTest --master yarn --deploy-mode client --files any.properties,yy-site.xml --jars ...........

Upvotes: 6

Jugraj Singh
Jugraj Singh

Reputation: 549

A way around the problem is that you can create a temporary SparkContext simply by calling SparkContext.getOrCreate() and then read the file you passed in the --files with the help of SparkFiles.get('FILE').

Once you read the file retrieve all necessary configuration you required in a SparkConf() variable.

After that call this function:

SparkContext.stop(SparkContext.getOrCreate())

This will distroy the existing SparkContext and than in the next line simply initalize a new SparkContext with the necessary configurations like this.

sc = SparkContext(conf=conf).getOrCreate()

You got yourself a SparkContext with the desired settings

Upvotes: 0

prossblad
prossblad

Reputation: 944

Here's a nice solution I developed in Python Spark in order to integrate any data as a file from outside to your Big Data platform.

Have fun.

# Load from the Spark driver any local text file and return a RDD (really useful in YARN mode to integrate new data at the fly)
# (See https://community.hortonworks.com/questions/38482/loading-local-file-to-apache-spark.html)
def parallelizeTextFileToRDD(sparkContext, localTextFilePath, splitChar):
    localTextFilePath = localTextFilePath.strip(' ')
    if (localTextFilePath.startswith("file://")):
        localTextFilePath = localTextFilePath[7:]
    import subprocess
    dataBytes = subprocess.check_output("cat " + localTextFilePath, shell=True)
    textRDD = sparkContext.parallelize(dataBytes.split(splitChar))
    return textRDD

# Usage example
myRDD = parallelizeTextFileToRDD(sc, '~/myTextFile.txt', '\n') # Load my local file as a RDD
myRDD.saveAsTextFile('/user/foo/myTextFile') # Store my data to HDFS

Upvotes: 0

Ton Torres
Ton Torres

Reputation: 1519

Yes, you can access files uploaded via the --files argument.

This is how I'm able to access files passed in via --files:

./bin/spark-submit \
--class com.MyClass \
--master yarn-cluster \
--files /path/to/some/file.ext \
--jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-rdbms-3.2.9.jar,lib/datanucleus-core-3.2.10.jar \
/path/to/app.jar file.ext

and in my Spark code:

val filename = args(0)
val linecount = Source.fromFile(filename).getLines.size

I do believe these files are downloaded onto the workers in the same directory as the jar is placed, which is why simply passing the filename and not the absolute path to Source.fromFile works.

Upvotes: 34

Related Questions