creating dataframe by loading csv file using scala in spark

Question

but csv file is added with extra double quotes which results all cloumns into single column

there are four columns,header and 2 rows

"""SlNo"",""Name"",""Age"",""contact"""
"1,""Priya"",78,""Phone"""
"2,""Jhon"",20,""mail"""

val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("delimiter",",").option("inferSchema","true").load ("bank.csv") 
df: org.apache.spark.sql.DataFrame = ["SlNo","Name","Age","contact": string]

Ramesh Maharjan · Accepted Answer

What you can do is read it using sparkContext and replace all " with empty and use zipWithIndex() to separate header and text data so that custom schema and row rdd data can be created. Finally just use the row rdd and schema in sqlContext's createDataFrame api

//reading text file, replacing and splitting and finally zipping with index
val rdd = sc.textFile("bank.csv").map(_.replaceAll("\"", "").split(",")).zipWithIndex()
//separating header to form schema
val header = rdd.filter(_._2 == 0).flatMap(_._1).collect()
val schema = StructType(header.map(StructField(_, StringType, true)))
//separating data to form row rdd
val rddData = rdd.filter(_._2 > 0).map(x => Row.fromSeq(x._1))
//creating the dataframe
sqlContext.createDataFrame(rddData, schema).show(false)

You should be getting

+----+-----+---+-------+
|SlNo|Name |Age|contact|
+----+-----+---+-------+
|1   |Priya|78 |Phone  |
|2   |Jhon |20 |mail   |
+----+-----+---+-------+

I hope the answer is helpful

creating dataframe by loading csv file using scala in spark

Answers (1)

Related Questions