Spark Scala Dataframe Conversion

Question

I have a textfile that is delimited with ~, I need to perform some parsing before converting to a dataframe. The code reads in a textfile as RDD[String] does some parsing. Then, it converts to RDD[Row]. Then with the schema creates a dataframe.

So below is the following code I have. It works but the problem is the actual schema is 400 fields long. I was wondering if there is a simpler way than typing attributes(1), attributes(2), attributes(3)... so on.

I am currently on Spark 1.6. CDH 5.2.2

Example Input:

20161481132310 ~     ~"This" is a comma 10

Current Code:

val schema_1 = StructType(Array(
StructField("EXAMPLE_1", StringType, true),
StructField("EXAMPLE_2", StringType, true),
StructField("EXAMPLE_3", StringType, true)))

val rdd = sc.textFile("example.txt")
val rdd_truncate =  rdd.map(_.split("~").map(_.trim).mkString("~"))
val row_final = rdd_truncate
  .map(_.split("~"))
  .map(attributes => Row(attributes(0),
    attributes(1),
    attributes(2)))

val df = sqlContext.createDataFrame(row_final, schema_1)

Based on suggestion I modified for following. It works except on quotes. The "This" in the input will fail. Any suggestions?

val df = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("delimiter","~")
      .schema(schema)
      .load("example.txt")
val df_final = df.select(df.columns.map(c =>trim(col(c)).alias(c)): _*)

zero323 · Accepted Answer

Just use standard CSV reader:

spark.read.schema(schema).option("delimiter", "~").csv("example.txt")

If you want to trim fields just use select:

import org.apache.spark.sql.functions.{col, trim}

df.select(df.columns.map(c => trim(col(c)).alias(c)): _*)

If you use Spark 1.x you can use spark-csv:

sqlContext.read
  .format("com.databricks.spark.csv")
  .schema(schema)
  .option("delimiter", "~")
  .load("example.txt")

If this is for some reason not sufficient you can use Row.fromSeq:

Row.fromSeq(line.split("~").take(3))

Spark Scala Dataframe Conversion

Answers (1)

Related Questions