asdasd2a43qaad
asdasd2a43qaad

Reputation: 65

Single column delimited string rdd to correctly columned dataframe

I have an rdd that is just a single column. Each column is a string representing a list of entries delimited by |. For example:

  col_1
 a|b|c|d
 q|w|e|r

I want to convert this to a dataframe, so it is like this:

col_1 | col_2 | col_3 | col_4
 a        b       c        d
 q        w       e        r

The number of columns is unknown, and no headers are needed (they can just be default column names).

I have tried:

.map(i => i.split("|")).toDF()

However, this just returns a single column that is an array of values, instead of actually splitting into columns. The end goal of this is to write it to a parquet file.

One solution is to write it to a text file, then read it with Spark as a csv with my given delimiter, then write it to a parquet file. But this is a terrible approach, there has to be a better way to do this.

Upvotes: 2

Views: 878

Answers (1)

Tzach Zohar
Tzach Zohar

Reputation: 37822

A DataFrame must have a predefined schema, so you'll have to supply the number of columns in some way. If different records might have different number of delimiters, you'd have to scan the data twice (once to determine columns, then once to convert to DataFrame); Otherwise - "peeking" into the first record might suffice:

import spark.implicits._

// note the necessary escaping because | is a special character in regular expressions
val arrays = rdd.map(_.split("\\|")) 

// if not all values have the same number of delimiters:
val maxCols = arrays.map(_.length).max()

// otherwise - can use first record to determine number of columns:
val maxCols = arrays.first().length

// now we create a column per (1 .. maxCols) and select these:
val result = arrays.toDF("arr")
  .select((0 until maxCols).map(i => $"arr"(i).as(s"col_$i")): _*)

result.show()
+-----+-----+-----+-----+
|col_0|col_1|col_2|col_3|
+-----+-----+-----+-----+
|    a|    b|    c|    d|
|    q|    w|    e|    r|
+-----+-----+-----+-----+

Upvotes: 3

Related Questions