Reputation: 71
I have a piece of scala code that works locally
val test = "resources/test.csv"
val trainInput = spark.read
.option("header", "true")
.option("inferSchema", "true")
.format("com.databricks.spark.csv")
.load(train)
.cache
However when i try to run it on azure, spark by submitting the job, and adjusting the following line:
val test = "wasb:///tmp/MachineLearningScala/test.csv"
It doesn't work. How do i reference files in blob storage in azure using scala? This should be straight forward.
Upvotes: 1
Views: 7039
Reputation: 23119
If you are using sbt add this dependency to built.sbt
"org.apache.hadoop" % "hadoop-azure" % "2.7.3"
For maven add the dependency as
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-azure</artifactId>
<version>2.7.0</version>
</dependency>
To read the files from blob storage you need to define the file system to be used in the underlying Hadoop configurations.
spark.sparkContext.hadoopConfiguration.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.key.yourAccount.blob.core.windows.net", "yourKey ")
And read the csv
file as
val path = "wasb[s]://[email protected]"
val dataframe = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(path + "/tmp/MachineLearningScala/test.csv")
here is the example Hope this helped!
Upvotes: 2