premon
premon

Reputation: 159

creating dataframe by loading csv file using scala in spark

but csv file is added with extra double quotes which results all cloumns into single column

there are four columns,header and 2 rows

"""SlNo"",""Name"",""Age"",""contact"""
"1,""Priya"",78,""Phone"""
"2,""Jhon"",20,""mail"""

val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("delimiter",",").option("inferSchema","true").load ("bank.csv") 
df: org.apache.spark.sql.DataFrame = ["SlNo","Name","Age","contact": string]

Upvotes: 0

Views: 1875

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41987

What you can do is read it using sparkContext and replace all " with empty and use zipWithIndex() to separate header and text data so that custom schema and row rdd data can be created. Finally just use the row rdd and schema in sqlContext's createDataFrame api

//reading text file, replacing and splitting and finally zipping with index
val rdd = sc.textFile("bank.csv").map(_.replaceAll("\"", "").split(",")).zipWithIndex()
//separating header to form schema
val header = rdd.filter(_._2 == 0).flatMap(_._1).collect()
val schema = StructType(header.map(StructField(_, StringType, true)))
//separating data to form row rdd
val rddData = rdd.filter(_._2 > 0).map(x => Row.fromSeq(x._1))
//creating the dataframe
sqlContext.createDataFrame(rddData, schema).show(false)

You should be getting

+----+-----+---+-------+
|SlNo|Name |Age|contact|
+----+-----+---+-------+
|1   |Priya|78 |Phone  |
|2   |Jhon |20 |mail   |
+----+-----+---+-------+

I hope the answer is helpful

Upvotes: 1

Related Questions