Reputation: 2005
I am trying to create an ArrayType
from an StringType
but I am unable to do a trim
and split
at the same time.
data = [
(
"1",
"Example 1",
"element1 | element two | element (three) ",
),
]
schema = StructType(
[
StructField("id", StringType(), True),
StructField("name", StringType(), True),
StructField("value", StringType(), True),
]
)
df = spark.createDataFrame(data, schema)
df = (df.withColumn("value_array",
f.trim(f.split(f.col("value"), "\|")
)
)
)
AnalysisException: cannot resolve 'trim(split(`value`, '\\|', -1))' due to data type mismatch: argument 1 requires string type, however, 'split(`value`, '\\|', -1)' is of array<string> type.;
df = (df.withColumn("value_arry",
f.split(f.col("value"), "\|")
)
)
"[e1| e 2 | e3 ]" #These values contains spaces in
+---+---------+---------------+-------------------+
| id| name| value| value_arry|
+---+---------+---------------+-------------------+
| 1|Example 1|e1| e 2 | e3 3 |[e1, e 2 , e3 3 ]|
| 2|Example 2| e1 | e2 | [e1 , e2 ]|
+---+---------+---------------+-------------------+
The result I would like to get would have each element without spaces
Upvotes: 1
Views: 1292
Reputation: 32660
You can use transform
function after split:
from pyspark.sql import functions as F
df = df.withColumn("value_arry", F.expr("transform(split(value, '[|]'), x -> trim(x))"))
df.show(truncate=False)
#+---+---------+-----------------------------------------+----------------------------------------+
#|id |name |value |value_arry |
#+---+---------+-----------------------------------------+----------------------------------------+
#|1 |Example 1|element1 | element two | element (three) |[element1, element two, element (three)]|
#+---+---------+-----------------------------------------+----------------------------------------+
Or split using regex \s+\|\s+
:
df = df.withColumn("value_arry", F.split("value", r"\s+\|\s+"))
Upvotes: 1