JetS79
JetS79

Reputation: 71

How to read file from Blob storage using scala to spark

I have a piece of scala code that works locally

val test = "resources/test.csv"

val trainInput = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .format("com.databricks.spark.csv")
  .load(train)
  .cache

However when i try to run it on azure, spark by submitting the job, and adjusting the following line:

val test = "wasb:///tmp/MachineLearningScala/test.csv"

It doesn't work. How do i reference files in blob storage in azure using scala? This should be straight forward.

Upvotes: 1

Views: 7039

Answers (1)

koiralo
koiralo

Reputation: 23119

If you are using sbt add this dependency to built.sbt

"org.apache.hadoop" % "hadoop-azure" % "2.7.3"

For maven add the dependency as

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-azure</artifactId>
    <version>2.7.0</version>
</dependency>

To read the files from blob storage you need to define the file system to be used in the underlying Hadoop configurations.

spark.sparkContext.hadoopConfiguration.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.key.yourAccount.blob.core.windows.net", "yourKey ")

And read the csv file as

  val path = "wasb[s]://[email protected]"
  val dataframe = spark.read
    .option("header", "true")
    .option("inferSchema", "true")
    .csv(path + "/tmp/MachineLearningScala/test.csv")

here is the example Hope this helped!

Upvotes: 2

Related Questions