Spark - split a string column escaping the delimiter in one part

Question

I have a CSV file of two string columns (term, code). The code column has a special format [num]-[two_letters]-[text] where the text can also contain dashes -. I want to read this file using Spark into a dataframe of exactly four columns (term, num, two_letters, text).

Input
+---------------------------------+
|  term  |          code          |
+---------------------------------+
| term01 |    12-AB-some text     |
| term02 | 130-CD-some-other-text |
+---------------------------------+


Output
+------------------------------------------+
|  term  | num | letters |       text      |
+------------------------------------------+
| term01 | 12  |   AB    |   some text     |
| term02 | 130 |   CD    | some-other-text |
+------------------------------------------+

I can split the code column into three columns when there is no dash in its text part, but how can I achieve a solution that tackle all cases (something like get all text after exactly two dashes into one column)?

The code to split a column into three is clarified well in the answer here

akuiper · Accepted Answer

Here is one option with regexp_extract:

val df = Seq(("term01", "12-AB-some text"), ("term02", "130-CD-some-other-text")).toDF("term", "code")

// define the pattern that matches the string column
val p = "([0-9]+)-([a-zA-Z]{2})-(.*)"
// p: String = ([0-9]+)-([a-zA-Z]{2})-(.*)

// define the map from new column names to the group index in the pattern
val cols = Map("num" -> 1, "letters" -> 2, "text" -> 3)
// cols: scala.collection.immutable.Map[String,Int] = Map(num -> 1, letters -> 2, text -> 3)

// create the new columns on data frame
cols.foldLeft(df){ 
    case (df, (colName, groupIdx)) => df.withColumn(colName, regexp_extract($"code", p, groupIdx)) 
}.drop("code").show

+------+---+-------+---------------+
|  term|num|letters|           text|
+------+---+-------+---------------+
|term01| 12|     AB|      some text|
|term02|130|     CD|some-other-text|
+------+---+-------+---------------+

Spark - split a string column escaping the delimiter in one part

Answers (2)

Related Questions