Substring each element of an array column in PySpark 2.2

Question

I would like to substring each element of an array column in PySpark 2.2. My df looks like the one below, which is similar to this, although each element in my df has the same length before the hyphen delimiter.

+---------------------------------+----------------------+
|col1                             |new_column            |
+---------------------------------+----------------------+
|[hello-123, abcde-111]           |[hello, abcde]        |
|[hello-234, abcde-221, xyzhi-333]|[hello, abcde, xyzhi] |
|[hiiii-111, abbbb-333, xyzhu-222]|[hiiii, abbbb, xyzhu] |
+---------------------------------+----------------------+

I tried adjusting the udf in the prior question based on this answer to obtain the output in new_column above, but no luck so far. Is there a way to make this work in PySpark 2.2?

import pyspark.sql.functions as F
import pyspark.sql.types as T 

cust_udf = F.udf(lambda arr: [x[0:4] for x in arr], T.ArrayType(T.StringType()))
df1.withColumn('new_column', cust_udf(col("col1")))

Egodym · Accepted Answer

Solved this using a different approach: explode the array, substring the elements, and then collect back to array.

import pyspark.sql.functions as F
    
df1\
   .withColumn('idx', F.monotonically_increasing_id())\
   .withColumn('exploded_col', F.explode(col('col1')))\
   .withColumn('substr_col', F.substring(col('exploded_col'),1,5))\
   .groupBy(col('idx'))\
   .agg(F.collect_set('substr_col').alias('new_column'))

Substring each element of an array column in PySpark 2.2

Answers (2)

Related Questions