How to read file from Blob storage using scala to spark

Question

I have a piece of scala code that works locally

val test = "resources/test.csv"

val trainInput = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .format("com.databricks.spark.csv")
  .load(train)
  .cache

However when i try to run it on azure, spark by submitting the job, and adjusting the following line:

val test = "wasb:///tmp/MachineLearningScala/test.csv"

It doesn't work. How do i reference files in blob storage in azure using scala? This should be straight forward.

koiralo · Accepted Answer

If you are using sbt add this dependency to built.sbt

"org.apache.hadoop" % "hadoop-azure" % "2.7.3"

For maven add the dependency as


    org.apache.hadoop
    hadoop-azure
    2.7.0

To read the files from blob storage you need to define the file system to be used in the underlying Hadoop configurations.

spark.sparkContext.hadoopConfiguration.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.key.yourAccount.blob.core.windows.net", "yourKey ")

And read the csv file as

  val path = "wasb[s]://BlobStorageContainer@yourUser.blob.core.windows.net"
  val dataframe = spark.read
    .option("header", "true")
    .option("inferSchema", "true")
    .csv(path + "/tmp/MachineLearningScala/test.csv")

here is the example Hope this helped!

How to read file from Blob storage using scala to spark

Answers (1)

Related Questions