Glarixon
Glarixon

Reputation: 59

How do I split a column by using delimiters from another column in Spark/Scala

I have another question that is related to the split function. I am new to Spark/Scala.

below is the sample data frame -


+-------------------+---------+
|             VALUES|Delimiter|
+-------------------+---------+
|       50000.0#0#0#|        #|
|          [email protected]@|        @|
|                 1$|        $|
|1000.00^Test_string|        ^|
+-------------------+---------+

and I want the output to be -

+-------------------+---------+----------------------+
|VALUES             |Delimiter|split_values          |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0, ]     |
|[email protected]@          |@        |[0, 1000.0, ]         |
|1$                 |$        |[1, ]                 |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

I tried to split this manually -

dept.select(split(col("VALUES"),"#|@|\\$|\\^").show()

and the output is -

+-----------------------+
|split(VALUES,#|@|\$|\^)|
+-----------------------+
|      [50000.0, 0, 0, ]|
|          [0, 1000.0, ]|
|                  [1, ]|
|   [1000.00, Test_st...|
+-----------------------+


But I want to pull up the delimiter automatically for a large dataset.

Upvotes: 5

Views: 2309

Answers (5)

Manoj Kumar
Manoj Kumar

Reputation: 46

Enhancing @dsk Answer to handle two more cases.


Scala Code

First Case: To handle special characters ('$' ,'^') we need to add escape Characters backslash '\\' as prefix on Delimiter column, like this below..

import org.apache.spark.sql.functions.expr
val new_deptArray = dept.withColumn("split_values",expr("""split(VALUES, Concat('\\',Delimiter))"""))
new_deptArray.show(truncate = false)

And this will be the Output.

+-------------------+---------+----------------------+
|VALUES             |Delimiter|split_values          |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0, ]     |
|[email protected]@          |@        |[0, 1000.0, ]         |
|1$                 |$        |[1, ]                 |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

Second Case: Notice that few rows got that extra space at the end and ideally, we don't want them. so, this new column split_values is of type Array<String> and we can just use array_remove to remove these extra spaces.

import org.apache.spark.sql.functions.{expr, array_remove}
val new_deptArray = dept.withColumn("split_values",array_remove(expr("""split(VALUES, Concat('\\',Delimiter))"""), ""))
new_deptArray.show(truncate = false)

it will give this output.

+-------------------+---------+----------------------+
|VALUES             |Delimiter|split_values          |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0]       |
|[email protected]@          |@        |[0, 1000.0]           |
|1$                 |$        |[1]                   |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

Pyspark Code

It worked almost same in pyspark also just the difference was we need to add four backslashes backslash '\\\\' unlike Scala where only two backslash was sufficient.

from pyspark.sql.functions import expr, array_remove
new_deptArray = dept.withColumn("split_values",array_remove(expr("""split(VALUES, concat("\\\\",Delimiter))"""), ""))
new_deptArray.show(truncate = False)

+-------------------+---------+----------------------+
|VALUES             |Delimiter|split_values          |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0]       |
|[email protected]@          |@        |[0, 1000.0]           |
|1$                 |$        |[1]                   |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

Thanks.

Upvotes: 1

Apollo
Apollo

Reputation: 41

This is my lately solution

import java.util.regex.Pattern
val split_udf = udf((value: String, delimiter: String) => value.split(Pattern.quote(delimiter), -1))
val solution = dept.withColumn("split_values", split_udf(col("VALUES"),col("Delimiter")))
solution.show(truncate = false)

it will skip special characters in Delimiter column. Other answers not work for

("50000.0\\0\\0\\", "\\")

and linusRian's answer need to add special characters manually

Upvotes: 1

Pubudu Sitinamaluwa
Pubudu Sitinamaluwa

Reputation: 978

EDIT: Please check the bottom of the answer for scala version.

You can use a custom user-defined function (pyspark.sql.functions.udf) to achieve this.

from typing import List

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, ArrayType


def split_col(value: StringType, delimiter: StringType) -> List[str]:
    return str(value).split(str(delimiter))


udf_split = udf(lambda x, y: split_col(x, y), ArrayType(StringType()))

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([
    ('50000.0#0#0#', '#'), ('[email protected]@', '@'), ('1$', '$'), ('1000.00^Test_string', '^')
], schema='VALUES String, Delimiter String')

df = df.withColumn("split_values", udf_split(df['VALUES'], df['Delimiter']))

df.show(truncate=False)

Output

+-------------------+---------+----------------------+
|VALUES             |Delimiter|split_values          |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0, ]     |
|[email protected]@          |@        |[0, 1000.0, ]         |
|1$                 |$        |[1, ]                 |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

Note that the split_values column contains a list of strings. You can also update split_col function to do more changes to values.

EDIT : Scala version

import org.apache.spark.sql.functions.udf

import spark.implicits._

val data = Seq(("50000.0#0#0#", "#"), ("[email protected]@", "@"), ("1$", "$"), ("1000.00^Test_string", "^"))
var df = data.toDF("VALUES", "Delimiter")

val udf_split_col = udf {(x:String,y:String)=> x.split(y)}

df = df.withColumn("split_values", udf_split_col(df.col("VALUES"), df.col("Delimiter")))

df.show(false)

Edit 2

To avoid the issue with special characters used in regexes, you can use char instead of a String when using the split() method as follow.

val udf_split_col = udf { (x: String, y: String) => x.split(y.charAt(0)) }

Upvotes: 3

linusRian
linusRian

Reputation: 340

This is another way of handling this,using sparksql

df.createOrReplaceTempView("test")

spark.sql("""select VALUES,delimiter,split(values,case when delimiter in ("$","^") then concat("\\",delimiter) else delimiter end) as split_value from test""").show(false)

Note that I included the case when statement to add escape characters to handle cases for '$' and '^',otherwise it doesn't split.

+-------------------+---------+----------------------+
|VALUES             |delimiter|split_value           |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0, ]     |
|[email protected]@          |@        |[0, 1000.0, ]         |
|1$                 |$        |[1, ]                 |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

Upvotes: 2

dsk
dsk

Reputation: 2003

You need to use expr with split() to make the split dynamic

df = spark.createDataFrame([("50000.0#0#0#","#"),("[email protected]@","@")],["VALUES","Delimiter"])
df = df.withColumn("split", F.expr("""split(VALUES, Delimiter)"""))
df.show()

+------------+---------+-----------------+
|      VALUES|Delimiter|            split|
+------------+---------+-----------------+
|50000.0#0#0#|        #|[50000.0, 0, 0, ]|
|   [email protected]@|        @|    [0, 1000.0, ]|
+------------+---------+-----------------+

Upvotes: 4

Related Questions