user1342645
user1342645

Reputation: 655

spark scala issue uploading csv

i am trying to upload a csv file into a tempTable such that I can query on it and I am having two issues. First: I tried uploading the csv to a DataFrame, and this csv has some empty fields.... and I didn't find a way to do it. I found someone posting in another post to use :

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("cars.csv")

but it gives me an error saying "Failed to load class for data source: com.databricks.spark.csv"

Then I uploaded the file and read it as a text file, without the headings as:

val sqlContext = new org.apache.spark.sql.SQLContext(sc);
import sqlContext.implicits._;
case class cars(id: Int, name: String, licence: String);
val carsDF = sc.textFile("../myTests/cars.csv").map(_.split(",")).map(p => cars( p(0).trim.toInt, p(1).trim, p(2).trim) ).toDF();
carsDF.registerTempTable("cars");
val dgp = sqlContext.sql("SELECT * FROM cars");
dgp.show()

gives an error because one of the licence field is empty... I tried to control this issue when I build the data frame but did not work. I can obviously go into the csv file but and fix by adding a null to it but U do not want to do this because of there are a lot of fields it could be problematic. I want to fix it programmatically either when i create the dataframe or the class...

any other thoughts please let me know as well

Upvotes: 0

Views: 536

Answers (2)

Abu Shoeb
Abu Shoeb

Reputation: 5152

Here you go. Remember to check the delimiter for your CSV.

// create spark session
val spark = org.apache.spark.sql.SparkSession.builder
        .master("local")
        .appName("Spark CSV Reader")
        .getOrCreate;

// read csv
val df = spark.read
         .format("csv")
         .option("header", "true") //reading the headers
         .option("mode", "DROPMALFORMED")
         .option("delimiter", ",")
         .load("/your/csv/dir/simplecsv.csv")

// create a table from dataframe
df.createOrReplaceTempView("tableName")
// run your sql query
val sqlResults = spark.sql("SELECT * FROM tableName")
// display sql results
display(sqlResults)

Upvotes: 0

zero323
zero323

Reputation: 330083

To be able to use spark-csv you have to make sure it is available. In an interactive mode the simplest solution is to use packages argument when you start shell:

bin/spark-shell --packages com.databricks:spark-csv_2.10:1.1.0

Regarding manual parsing working with csv files, especially malformed like cars.csv, requires much more work than simply splitting on commas. Some things to consider:

  • how to detect csv dialect, including method of string quoting
  • how to handle quotes and new line characters inside strings
  • how handle malformed lines

In case of example file you have to at least:

  • filter empty lines
  • read header
  • map lines to fields providing default value if field is missing

Upvotes: 1

Related Questions