Egodym
Egodym

Reputation: 451

Substring each element of an array column in PySpark 2.2

I would like to substring each element of an array column in PySpark 2.2. My df looks like the one below, which is similar to this, although each element in my df has the same length before the hyphen delimiter.

+---------------------------------+----------------------+
|col1                             |new_column            |
+---------------------------------+----------------------+
|[hello-123, abcde-111]           |[hello, abcde]        |
|[hello-234, abcde-221, xyzhi-333]|[hello, abcde, xyzhi] |
|[hiiii-111, abbbb-333, xyzhu-222]|[hiiii, abbbb, xyzhu] |
+---------------------------------+----------------------+

I tried adjusting the udf in the prior question based on this answer to obtain the output in new_column above, but no luck so far. Is there a way to make this work in PySpark 2.2?

import pyspark.sql.functions as F
import pyspark.sql.types as T 

cust_udf = F.udf(lambda arr: [x[0:4] for x in arr], T.ArrayType(T.StringType()))
df1.withColumn('new_column', cust_udf(col("col1")))

Upvotes: 1

Views: 1697

Answers (2)

Egodym
Egodym

Reputation: 451

Solved this using a different approach: explode the array, substring the elements, and then collect back to array.

import pyspark.sql.functions as F
    
df1\
   .withColumn('idx', F.monotonically_increasing_id())\
   .withColumn('exploded_col', F.explode(col('col1')))\
   .withColumn('substr_col', F.substring(col('exploded_col'),1,5))\
   .groupBy(col('idx'))\
   .agg(F.collect_set('substr_col').alias('new_column'))

Upvotes: 1

akuiper
akuiper

Reputation: 214927

Your udf approach works for me. Besides you can use transform with substring:

import pyspark.sql.functions as f

df.withColumn('new_column', f.expr('transform(col1, x -> substring(x, 0, 5))')).show()

+--------------------+--------------------+
|                col1|          new_column|
+--------------------+--------------------+
|[hello-123, abcde...|      [hello, abcde]|
|[hello-234, abcde...|[hello, abcde, xy...|
|[hiiii-111, abbbb...|[hiiii, abbbb, xy...|
+--------------------+--------------------+

Upvotes: 1

Related Questions