Reputation: 59
I have another question that is related to the split function. I am new to Spark/Scala.
below is the sample data frame -
+-------------------+---------+
| VALUES|Delimiter|
+-------------------+---------+
| 50000.0#0#0#| #|
| [email protected]@| @|
| 1$| $|
|1000.00^Test_string| ^|
+-------------------+---------+
and I want the output to be -
+-------------------+---------+----------------------+
|VALUES |Delimiter|split_values |
+-------------------+---------+----------------------+
|50000.0#0#0# |# |[50000.0, 0, 0, ] |
|[email protected]@ |@ |[0, 1000.0, ] |
|1$ |$ |[1, ] |
|1000.00^Test_string|^ |[1000.00, Test_string]|
+-------------------+---------+----------------------+
I tried to split this manually -
dept.select(split(col("VALUES"),"#|@|\\$|\\^").show()
and the output is -
+-----------------------+
|split(VALUES,#|@|\$|\^)|
+-----------------------+
| [50000.0, 0, 0, ]|
| [0, 1000.0, ]|
| [1, ]|
| [1000.00, Test_st...|
+-----------------------+
But I want to pull up the delimiter automatically for a large dataset.
Upvotes: 5
Views: 2309
Reputation: 46
First Case:
To handle special characters ('$' ,'^')
we need to add escape Characters backslash '\\'
as prefix on Delimiter column, like this below..
import org.apache.spark.sql.functions.expr
val new_deptArray = dept.withColumn("split_values",expr("""split(VALUES, Concat('\\',Delimiter))"""))
new_deptArray.show(truncate = false)
And this will be the Output.
+-------------------+---------+----------------------+
|VALUES |Delimiter|split_values |
+-------------------+---------+----------------------+
|50000.0#0#0# |# |[50000.0, 0, 0, ] |
|[email protected]@ |@ |[0, 1000.0, ] |
|1$ |$ |[1, ] |
|1000.00^Test_string|^ |[1000.00, Test_string]|
+-------------------+---------+----------------------+
Second Case:
Notice that few rows got that extra space at the end and ideally, we don't want them. so, this new column split_values
is of type Array<String>
and we can just use array_remove
to remove these extra spaces.
import org.apache.spark.sql.functions.{expr, array_remove}
val new_deptArray = dept.withColumn("split_values",array_remove(expr("""split(VALUES, Concat('\\',Delimiter))"""), ""))
new_deptArray.show(truncate = false)
it will give this output.
+-------------------+---------+----------------------+
|VALUES |Delimiter|split_values |
+-------------------+---------+----------------------+
|50000.0#0#0# |# |[50000.0, 0, 0] |
|[email protected]@ |@ |[0, 1000.0] |
|1$ |$ |[1] |
|1000.00^Test_string|^ |[1000.00, Test_string]|
+-------------------+---------+----------------------+
It worked almost same in pyspark also just the difference was we need to add four backslashes backslash '\\\\'
unlike Scala where only two backslash was sufficient.
from pyspark.sql.functions import expr, array_remove
new_deptArray = dept.withColumn("split_values",array_remove(expr("""split(VALUES, concat("\\\\",Delimiter))"""), ""))
new_deptArray.show(truncate = False)
+-------------------+---------+----------------------+
|VALUES |Delimiter|split_values |
+-------------------+---------+----------------------+
|50000.0#0#0# |# |[50000.0, 0, 0] |
|[email protected]@ |@ |[0, 1000.0] |
|1$ |$ |[1] |
|1000.00^Test_string|^ |[1000.00, Test_string]|
+-------------------+---------+----------------------+
Thanks.
Upvotes: 1
Reputation: 41
This is my lately solution
import java.util.regex.Pattern
val split_udf = udf((value: String, delimiter: String) => value.split(Pattern.quote(delimiter), -1))
val solution = dept.withColumn("split_values", split_udf(col("VALUES"),col("Delimiter")))
solution.show(truncate = false)
it will skip special characters in Delimiter column. Other answers not work for
("50000.0\\0\\0\\", "\\")
and linusRian's answer need to add special characters manually
Upvotes: 1
Reputation: 978
EDIT: Please check the bottom of the answer for scala version.
You can use a custom user-defined function (pyspark.sql.functions.udf
) to achieve this.
from typing import List
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, ArrayType
def split_col(value: StringType, delimiter: StringType) -> List[str]:
return str(value).split(str(delimiter))
udf_split = udf(lambda x, y: split_col(x, y), ArrayType(StringType()))
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
('50000.0#0#0#', '#'), ('[email protected]@', '@'), ('1$', '$'), ('1000.00^Test_string', '^')
], schema='VALUES String, Delimiter String')
df = df.withColumn("split_values", udf_split(df['VALUES'], df['Delimiter']))
df.show(truncate=False)
Output
+-------------------+---------+----------------------+
|VALUES |Delimiter|split_values |
+-------------------+---------+----------------------+
|50000.0#0#0# |# |[50000.0, 0, 0, ] |
|[email protected]@ |@ |[0, 1000.0, ] |
|1$ |$ |[1, ] |
|1000.00^Test_string|^ |[1000.00, Test_string]|
+-------------------+---------+----------------------+
Note that the split_values
column contains a list of strings. You can also update split_col
function to do more changes to values.
EDIT : Scala version
import org.apache.spark.sql.functions.udf
import spark.implicits._
val data = Seq(("50000.0#0#0#", "#"), ("[email protected]@", "@"), ("1$", "$"), ("1000.00^Test_string", "^"))
var df = data.toDF("VALUES", "Delimiter")
val udf_split_col = udf {(x:String,y:String)=> x.split(y)}
df = df.withColumn("split_values", udf_split_col(df.col("VALUES"), df.col("Delimiter")))
df.show(false)
Edit 2
To avoid the issue with special characters used in regexes, you can use char instead of a String when using the split()
method as follow.
val udf_split_col = udf { (x: String, y: String) => x.split(y.charAt(0)) }
Upvotes: 3
Reputation: 340
This is another way of handling this,using sparksql
df.createOrReplaceTempView("test")
spark.sql("""select VALUES,delimiter,split(values,case when delimiter in ("$","^") then concat("\\",delimiter) else delimiter end) as split_value from test""").show(false)
Note that I included the case when statement to add escape characters to handle cases for '$' and '^',otherwise it doesn't split.
+-------------------+---------+----------------------+
|VALUES |delimiter|split_value |
+-------------------+---------+----------------------+
|50000.0#0#0# |# |[50000.0, 0, 0, ] |
|[email protected]@ |@ |[0, 1000.0, ] |
|1$ |$ |[1, ] |
|1000.00^Test_string|^ |[1000.00, Test_string]|
+-------------------+---------+----------------------+
Upvotes: 2
Reputation: 2003
You need to use expr
with split()
to make the split dynamic
df = spark.createDataFrame([("50000.0#0#0#","#"),("[email protected]@","@")],["VALUES","Delimiter"])
df = df.withColumn("split", F.expr("""split(VALUES, Delimiter)"""))
df.show()
+------------+---------+-----------------+
| VALUES|Delimiter| split|
+------------+---------+-----------------+
|50000.0#0#0#| #|[50000.0, 0, 0, ]|
| [email protected]@| @| [0, 1000.0, ]|
+------------+---------+-----------------+
Upvotes: 4