Rohit Chopra
Rohit Chopra

Reputation: 567

Spark Dataframe - cannot resolve ... given

I was trying to create a data frame in Spark 1.6.0. I used this command to create it:-

val df = sqlContext.read.format("com.databricks.spark.csv")
  .option("header","true")
  .option("delimiter",",")
  .option("inferSchema","true")
  .load("/user/rohitchopra32_gmail/Project1_dataset_bank-full(2).csv")

It created a data frame but when I try using df.show() command it shows incomplete and unformatted data like this error and also when i tried to select data using val selectedData = df.select("age") command it shows me error selected data error

Link to my data set : -data set

I am new to spark and I don't know what cause this error . am I missing something?

Upvotes: 2

Views: 1906

Answers (1)

eliasah
eliasah

Reputation: 40380

Like I said in the comment, your CSV file is not well formatted, so let's rewrite it and parse it:

scala> sc.textFile(filePath).map(x => x.replaceAll("\"", "")).saveAsTextFile("./Downloads/clean_data")

Now that we have removed those trailing double quotes that are causing us trouble, we can load the CSV using the line of code that you have :

scala> sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("delimiter",";").option("inferSchema","true").load("./Downloads/clean_data").show
+---+-------------+--------+---------+-------+-------+-------+----+--------+---+-----+--------+--------+-----+--------+--------+---+
|age|          job| marital|education|default|balance|housing|loan| contact|day|month|duration|campaign|pdays|previous|poutcome|  y|
+---+-------------+--------+---------+-------+-------+-------+----+--------+---+-----+--------+--------+-----+--------+--------+---+
| 53|      unknown| married|  unknown|     no|      0|     no|  no|cellular| 25|  aug|     209|       5|   -1|       0| unknown| no|
| 51|   technician| married| tertiary|     no|     -3|     no|  no|cellular| 25|  aug|      91|       9|   -1|       0| unknown| no|
| 33|   technician|  single|secondary|     no|    -32|     no|  no|cellular| 25|  aug|     196|      12|   -1|       0| unknown| no|
| 48|   management|divorced| tertiary|     no|      0|     no|  no|cellular| 25|  aug|     110|       3|   -1|       0| unknown| no|
| 60|      retired| married|  primary|     no|    155|     no|  no|cellular| 25|  aug|     115|       7|   -1|       0| unknown| no|
| 50|   management|divorced| tertiary|     no|      0|     no|  no|cellular| 25|  aug|      57|       3|   -1|       0| unknown| no|
| 59|  blue-collar| married|  primary|     no|   6271|    yes|  no|cellular| 25|  aug|     102|       5|   -1|       0| unknown| no|
| 33|   technician|  single| tertiary|     no|    137|     no|  no|cellular| 25|  aug|      88|       4|   -1|       0| unknown| no|
| 37|self-employed| married|secondary|     no|    119|     no|  no|cellular| 25|  aug|      68|       4|   -1|       0| unknown| no|
| 45|  blue-collar| married|  primary|     no|    185|     no|  no|cellular| 25|  aug|      78|       4|   -1|       0| unknown| no|
| 47|   management| married|secondary|     no|   1083|     no|  no|cellular| 25|  aug|     141|       4|   -1|       0| unknown| no|
| 41|   technician| married|secondary|     no|   2039|     no|  no|cellular| 25|  aug|     160|       4|   -1|       0| unknown| no|
| 52|   management| married| tertiary|     no|    967|     no|  no|cellular| 25|  aug|     472|      10|   -1|       0| unknown| no|
| 35|   technician|  single| tertiary|     no|    275|    yes|  no|cellular| 25|  aug|      63|       5|   -1|       0| unknown| no|
| 34|   technician| married|secondary|     no|     47|     no|  no|cellular| 25|  aug|     132|       6|   -1|       0| unknown| no|
| 36|   management| married| tertiary|     no|   1235|     no|  no|cellular| 25|  aug|      85|       6|   -1|       0| unknown| no|
| 32|   technician| married|secondary|    yes|      4|    yes| yes|cellular| 25|  aug|     132|       8|   -1|       0| unknown| no|
| 36|   management| married| tertiary|     no|   3874|     no|  no|cellular| 25|  aug|     425|       6|   -1|       0| unknown| no|
| 58|  blue-collar| married|  unknown|     no|      9|     no|  no|cellular| 25|  aug|      50|      23|   -1|       0| unknown| no|
| 43|   technician| married|secondary|     no|    136|     no|  no|cellular| 25|  aug|     363|       7|   -1|       0| unknown|yes|
+---+-------------+--------+---------+-------+-------+-------+----+--------+---+-----+--------+--------+-----+--------+--------+---+
only showing top 20 rows

Upvotes: 2

Related Questions