Reputation: 451
I would like to substring each element of an array column in PySpark 2.2. My df looks like the one below, which is similar to this, although each element in my df has the same length before the hyphen delimiter.
+---------------------------------+----------------------+
|col1 |new_column |
+---------------------------------+----------------------+
|[hello-123, abcde-111] |[hello, abcde] |
|[hello-234, abcde-221, xyzhi-333]|[hello, abcde, xyzhi] |
|[hiiii-111, abbbb-333, xyzhu-222]|[hiiii, abbbb, xyzhu] |
+---------------------------------+----------------------+
I tried adjusting the udf in the prior question based on this answer to obtain the output in new_column
above, but no luck so far. Is there a way to make this work in PySpark 2.2?
import pyspark.sql.functions as F
import pyspark.sql.types as T
cust_udf = F.udf(lambda arr: [x[0:4] for x in arr], T.ArrayType(T.StringType()))
df1.withColumn('new_column', cust_udf(col("col1")))
Upvotes: 1
Views: 1697
Reputation: 451
Solved this using a different approach: explode the array, substring the elements, and then collect back to array.
import pyspark.sql.functions as F
df1\
.withColumn('idx', F.monotonically_increasing_id())\
.withColumn('exploded_col', F.explode(col('col1')))\
.withColumn('substr_col', F.substring(col('exploded_col'),1,5))\
.groupBy(col('idx'))\
.agg(F.collect_set('substr_col').alias('new_column'))
Upvotes: 1
Reputation: 214927
Your udf approach works for me. Besides you can use transform
with substring
:
import pyspark.sql.functions as f
df.withColumn('new_column', f.expr('transform(col1, x -> substring(x, 0, 5))')).show()
+--------------------+--------------------+
| col1| new_column|
+--------------------+--------------------+
|[hello-123, abcde...| [hello, abcde]|
|[hello-234, abcde...|[hello, abcde, xy...|
|[hiiii-111, abbbb...|[hiiii, abbbb, xy...|
+--------------------+--------------------+
Upvotes: 1