Reputation: 567
I was trying to create a data frame in Spark 1.6.0. I used this command to create it:-
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header","true")
.option("delimiter",",")
.option("inferSchema","true")
.load("/user/rohitchopra32_gmail/Project1_dataset_bank-full(2).csv")
It created a data frame but when I try using df.show()
command it shows incomplete and unformatted data like this
and also when i tried to select data using
val selectedData = df.select("age")
command it shows me error
Link to my data set : -data set
I am new to spark and I don't know what cause this error . am I missing something?
Upvotes: 2
Views: 1906
Reputation: 40380
Like I said in the comment, your CSV file is not well formatted, so let's rewrite it and parse it:
scala> sc.textFile(filePath).map(x => x.replaceAll("\"", "")).saveAsTextFile("./Downloads/clean_data")
Now that we have removed those trailing double quotes that are causing us trouble, we can load the CSV using the line of code that you have :
scala> sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("delimiter",";").option("inferSchema","true").load("./Downloads/clean_data").show
+---+-------------+--------+---------+-------+-------+-------+----+--------+---+-----+--------+--------+-----+--------+--------+---+
|age| job| marital|education|default|balance|housing|loan| contact|day|month|duration|campaign|pdays|previous|poutcome| y|
+---+-------------+--------+---------+-------+-------+-------+----+--------+---+-----+--------+--------+-----+--------+--------+---+
| 53| unknown| married| unknown| no| 0| no| no|cellular| 25| aug| 209| 5| -1| 0| unknown| no|
| 51| technician| married| tertiary| no| -3| no| no|cellular| 25| aug| 91| 9| -1| 0| unknown| no|
| 33| technician| single|secondary| no| -32| no| no|cellular| 25| aug| 196| 12| -1| 0| unknown| no|
| 48| management|divorced| tertiary| no| 0| no| no|cellular| 25| aug| 110| 3| -1| 0| unknown| no|
| 60| retired| married| primary| no| 155| no| no|cellular| 25| aug| 115| 7| -1| 0| unknown| no|
| 50| management|divorced| tertiary| no| 0| no| no|cellular| 25| aug| 57| 3| -1| 0| unknown| no|
| 59| blue-collar| married| primary| no| 6271| yes| no|cellular| 25| aug| 102| 5| -1| 0| unknown| no|
| 33| technician| single| tertiary| no| 137| no| no|cellular| 25| aug| 88| 4| -1| 0| unknown| no|
| 37|self-employed| married|secondary| no| 119| no| no|cellular| 25| aug| 68| 4| -1| 0| unknown| no|
| 45| blue-collar| married| primary| no| 185| no| no|cellular| 25| aug| 78| 4| -1| 0| unknown| no|
| 47| management| married|secondary| no| 1083| no| no|cellular| 25| aug| 141| 4| -1| 0| unknown| no|
| 41| technician| married|secondary| no| 2039| no| no|cellular| 25| aug| 160| 4| -1| 0| unknown| no|
| 52| management| married| tertiary| no| 967| no| no|cellular| 25| aug| 472| 10| -1| 0| unknown| no|
| 35| technician| single| tertiary| no| 275| yes| no|cellular| 25| aug| 63| 5| -1| 0| unknown| no|
| 34| technician| married|secondary| no| 47| no| no|cellular| 25| aug| 132| 6| -1| 0| unknown| no|
| 36| management| married| tertiary| no| 1235| no| no|cellular| 25| aug| 85| 6| -1| 0| unknown| no|
| 32| technician| married|secondary| yes| 4| yes| yes|cellular| 25| aug| 132| 8| -1| 0| unknown| no|
| 36| management| married| tertiary| no| 3874| no| no|cellular| 25| aug| 425| 6| -1| 0| unknown| no|
| 58| blue-collar| married| unknown| no| 9| no| no|cellular| 25| aug| 50| 23| -1| 0| unknown| no|
| 43| technician| married|secondary| no| 136| no| no|cellular| 25| aug| 363| 7| -1| 0| unknown|yes|
+---+-------------+--------+---------+-------+-------+-------+----+--------+---+-----+--------+--------+-----+--------+--------+---+
only showing top 20 rows
Upvotes: 2