Reputation: 702
I'm attempting the Kaggle Titanic Example using SparkML and Scala. I'm attempting to load the first training file but I am running into a strange error:
java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/Users/jake/Development/titanicExample/src/main/resources/data/titanic/train.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [44, 81, 13, 10]
The file is a .csv
so I'm not sure why its expecting a Parquet file.
Here is my code:
object App {
val spark = SparkSession
.builder()
.master("local[*]")
.appName("liveOrDie")
.getOrCreate()
def main(args: Array[String]) {
val rawTrainingData = spark.read
.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.load("src/main/resources/data/titanic/train.csv")
// rawTrainingData.show()
}
}
Upvotes: 1
Views: 2975
Reputation: 702
I seem to have had a conflict with Scala versions in my pom.xml
NOT my original code. My pom.xml
had multiple Scala versions seemingly causing issues. I updated all dependencies that used Scala to the same version using a dynamic property <scala.dep.version>2.11</scala.dep.version>
and that fixed the problem.
Upvotes: 1
Reputation: 851
You have to add a dependency jar from databricks into your pom . Lower version spark doesn't provide api to read csv. Once you download you can write something like below..
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
// Use first line of all files as header
.option("inferSchema", "true")
// Automatically infer data types
.load("cars.csv")
Ref url: https://github.com/databricks/spark-csv/blob/master/README.md
Upvotes: 0
Reputation: 5572
It is expecting a parquet file because that is what the default file type.
If you are using Spark < 2.0, you will need to use Spark-CSV. Otherwise if you are using Spark 2.0+ you will be able to use the DataFrameReader
by using .csv(..fname..)
instead of .load(..fname..)
.
Upvotes: 0
Reputation:
You're missing input format. Either:
val rawTrainingData = spark.read
.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.csv("src/main/resources/data/titanic/train.csv")
or
val rawTrainingData = spark.read
.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.format("csv")
.load("src/main/resources/data/titanic/train.csv")
Upvotes: 3