deepdive
deepdive

Reputation: 10962

spark-submit dependency resolution for spark-csv

I am doing small scala program which converts csv to parquet. I am using databricks spark-csv. Here's is my build.sbt

name: = "tst"

version: = "1.0"

scalaVersion: = "2.10.5"

libraryDependencies++ = Seq(
  "org.apache.spark" % % "spark-core" % "1.6.1" % "provided",
  "org.apache.spark" % % "spark-sql" % "1.6.1",
  "com.databricks" % "spark-csv_2.10" % "1.5.0",
  "org.apache.spark" % % "spark-hive" % "1.6.1",
  "org.apache.commons" % "commons-csv" % "1.1",
  "com.univocity" % "univocity-parsers" % "1.5.1",
  "org.slf4j" % "slf4j-api" % "1.7.5" % "provided",
  "org.scalatest" % % "scalatest" % "2.2.1" % "test",
  "com.novocode" % "junit-interface" % "0.9" % "test",
  "com.typesafe.akka" % "akka-actor_2.10" % "2.3.11",
  "org.scalatest" % % "scalatest" % "2.2.1",
  "com.holdenkarau" % % "spark-testing-base" % "1.6.1_0.3.3",
  "com.databricks" % "spark-csv_2.10" % "1.5.0",
  "org.joda" % "joda-convert" % "1.8.1"

)

After sbt package, when I run command

spark-submit --master local[*] target/scala-2.10/tst_2.10-1.0.jar

I get following error.

Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org

I can see the com.databricks_spark-csv_2.10-1.5.0.jar file in ~/.ivy2/jars/ downloaded by sbt package command

The source code of the dataconversion.scala

import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object dataconversion {
  def main(args: Array[String]) {
    val conf =
      new SparkConf()
      .setAppName("ClusterScore")
      .set("spark.storage.memoryFraction", "1")

    val sc = new SparkContext(conf)
    val sqlc = new SQLContext(sc)
    val df = sqlc.read
      .format("com.databricks.spark.csv")
      .option("header", "true") // Use first line of all files as header
      .option("inferSchema", "true") // Automatically infer data types
      .load("/tmp/cars.csv")
    println(df.printSchema)
  }

}

I can do spark-submit without error if I specify --jars option with explicit jar path. But that's not ideal. Please suggest.

Upvotes: 0

Views: 941

Answers (2)

Vidya
Vidya

Reputation: 30310

Use the sbt-assembly plugin to build a "fat jar" containing all your dependencies with sbt assembly, and then call spark-submit on that.

In general, when you get ClassNotFoundException, try exploding the jar you created to see what's in it with jar tvf target/scala-2.10/tst_2.10-1.0.jar. Checking what's in the Ivy cache is meaningless; that just tells you that SBT found it. As mathematicians say, that's necessary but not sufficient.

Upvotes: 1

FaigB
FaigB

Reputation: 2281

The mentioned library is required so you have options:

  1. Place com.databricks_spark-csv_2.10-1.5.0.jar in local or hdfs reachable path and provide as dependency with --jars parameter
  2. Using --packages com.databricks:spark-csv_2.10:1.5.0 which will provide required lib to your process
  3. To build fat jar with your dependencies and forget about --jars

Upvotes: 0

Related Questions