jalazbe
jalazbe

Reputation: 2005

pyspark trim and split

I am trying to create an ArrayType from an StringType but I am unable to do a trim and split at the same time.

Create DataFrame

data = [
    (
        "1",
        "Example 1",
        "element1 | element two | element (three) ",        
    ),     
]
schema = StructType(
    [

        StructField("id", StringType(), True),
        StructField("name", StringType(), True),
        StructField("value", StringType(), True),
    ]
)
df = spark.createDataFrame(data, schema)

What I try to do

df = (df.withColumn("value_array", 
                    f.trim(f.split(f.col("value"), "\|")
                                                 )
                                         )
                )

The error that I get

AnalysisException: cannot resolve 'trim(split(`value`, '\\|', -1))' due to data type mismatch: argument 1 requires string type, however, 'split(`value`, '\\|', -1)' is of array<string> type.;

Code that works but is not complete

df = (df.withColumn("value_arry", 
                    f.split(f.col("value"), "\|")
                    )
                )

The result I get

"[e1| e 2 | e3 ]" #These values contains spaces in 
 
+---+---------+---------------+-------------------+
| id|     name|          value|         value_arry|
+---+---------+---------------+-------------------+
|  1|Example 1|e1| e 2 | e3 3 |[e1,  e 2 ,  e3 3 ]|
|  2|Example 2|       e1 | e2 |        [e1 ,  e2 ]|
+---+---------+---------------+-------------------+

The result I would like to get would have each element without spaces

Spark Version 3.2.1

Upvotes: 1

Views: 1292

Answers (1)

blackbishop
blackbishop

Reputation: 32660

You can use transform function after split:

from pyspark.sql import functions as F

df = df.withColumn("value_arry", F.expr("transform(split(value, '[|]'), x -> trim(x))"))

df.show(truncate=False)
#+---+---------+-----------------------------------------+----------------------------------------+
#|id |name     |value                                    |value_arry                              |
#+---+---------+-----------------------------------------+----------------------------------------+
#|1  |Example 1|element1 | element two | element (three) |[element1, element two, element (three)]|
#+---+---------+-----------------------------------------+----------------------------------------+

Or split using regex \s+\|\s+:

df = df.withColumn("value_arry", F.split("value", r"\s+\|\s+"))

Upvotes: 1

Related Questions