Jake Henningsgaard
Jake Henningsgaard

Reputation: 702

Loading CSV in spark

I'm attempting the Kaggle Titanic Example using SparkML and Scala. I'm attempting to load the first training file but I am running into a strange error:

java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/Users/jake/Development/titanicExample/src/main/resources/data/titanic/train.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [44, 81, 13, 10]

The file is a .csv so I'm not sure why its expecting a Parquet file.

Here is my code:

object App {

  val spark = SparkSession
    .builder()
    .master("local[*]")
    .appName("liveOrDie")
    .getOrCreate()

  def main(args: Array[String]) {

    val rawTrainingData = spark.read
      .option("header", "true")
      .option("delimiter", ",")
      .option("inferSchema", "true")
      .load("src/main/resources/data/titanic/train.csv")

//    rawTrainingData.show()
  }
}

Upvotes: 1

Views: 2975

Answers (4)

Jake Henningsgaard
Jake Henningsgaard

Reputation: 702

I seem to have had a conflict with Scala versions in my pom.xml NOT my original code. My pom.xml had multiple Scala versions seemingly causing issues. I updated all dependencies that used Scala to the same version using a dynamic property <scala.dep.version>2.11</scala.dep.version> and that fixed the problem.

Upvotes: 1

A srinivas
A srinivas

Reputation: 851

You have to add a dependency jar from databricks into your pom . Lower version spark doesn't provide api to read csv. Once you download you can write something like below..

val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
// Use first line of all files as header
.option("inferSchema", "true")
// Automatically infer data types
.load("cars.csv")

Ref url: https://github.com/databricks/spark-csv/blob/master/README.md

Upvotes: 0

evan.oman
evan.oman

Reputation: 5572

It is expecting a parquet file because that is what the default file type.

If you are using Spark < 2.0, you will need to use Spark-CSV. Otherwise if you are using Spark 2.0+ you will be able to use the DataFrameReader by using .csv(..fname..) instead of .load(..fname..).

Upvotes: 0

user6022341
user6022341

Reputation:

You're missing input format. Either:

val rawTrainingData = spark.read
  .option("header", "true")
  .option("delimiter", ",")
  .option("inferSchema", "true")
  .csv("src/main/resources/data/titanic/train.csv")

or

val rawTrainingData = spark.read
  .option("header", "true")
  .option("delimiter", ",")
  .option("inferSchema", "true")
  .format("csv")
  .load("src/main/resources/data/titanic/train.csv")

Upvotes: 3

Related Questions