How to create a dataframe from Array[Strings]?

Question

I used rdd.collect() to create an Array and now I want to use this Array[Strings] to create a DataFrame. My test file is in the following format(separated by a pipe |).

TimeStamp
IdC
Name
FileName
Start-0f-fields
column01  
column02 
column03 
column04 
column05 
column06 
column07 
column08 
column010 
column11
End-of-fields
Start-of-data 
G0002B|0|13|IS|LS|Xys|Xyz|12|23|48|  
G0002A|0|13|IS|LS|Xys|Xyz|12|23|45|  
G0002x|0|13|IS|LS|Xys|Xyz|12|23|48|  
G0002C|0|13|IS|LS|Xys|Xyz|12|23|48|
End-of-data
document

the column name are in between Start-of-field and End-of-Field. I want to store "| " pipe separated in different columns of Dataframe.

like below example:

column01  column02 column03 column04 column05 column06 column07 column08 column010 column11
G0002C      0        13       IS       LS       Xys      Xyz     12        23         48
G0002x      0        13       LS       MS       Xys      Xyz     14        300        400

my code :

    val rdd = sc.textFile("the above text file")
    
    val columns = rdd.collect.slice(5,16).mkString(",") //  it will hold columnnames

    val data = rdd.collect.slice(5,16)
    val rdd1 = sc.parallelize(rdd.collect())
    val df = rdd1.toDf(columns)

but this is not giving me the above desired dataframe

Insung Park · Accepted Answer

Could you try this?

import spark.implicits._ // Add to use `toDS()` and `toDF()`

val rdd = sc.textFile("the above text file")
    
val columns = rdd.collect.slice(5,16) // `.mkString(",")` is not needed

val dataDS = rdd.collect.slice(5,16)
  .map(_.trim())                           // to remove whitespaces
  .map(s => s.substring(0, s.length - 1))  // to remove last pipe '|'
  .toSeq
  .toDS

val df = spark.read
  .option("header", false)
  .option("delimiter", "|")
  .csv(dataDS)
  .toDF(columns: _*)

df.show(false)

+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
|column01|column02|column03|column04|column05|column06|column07|column08|column010|column11|
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
|G0002B  |0       |13      |IS      |LS      |Xys     |Xyz     |12      |23       |48      |
|G0002A  |0       |13      |IS      |LS      |Xys     |Xyz     |12      |23       |45      |
|G0002x  |0       |13      |IS      |LS      |Xys     |Xyz     |12      |23       |48      |
|G0002C  |0       |13      |IS      |LS      |Xys     |Xyz     |12      |23       |48      |
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+

Calling spark.read...csv() method without schema, can take a long time with huge data, because of schema inferences(e,g. Additional reading).

On that case, you can specify schema like below.

/*
  column01 STRING,
  column02 STRING,
  column03 STRING,

  ...
*/
val schema = columns
  .map(c => s"$c STRING")
  .mkString(",
")


val df = spark.read
  .option("header", false)
  .option("delimiter", "|")
  .schema(schema)  // schema inferences not occurred
  .csv(dataDS)
// .toDF(columns: _*) => unnecessary when schema is specified

How to create a dataframe from Array[Strings]?

Answers (2)

Related Questions