How to remove some character from the list based on pattern in PySpark Dataframe

Question

This is the field contains list.

+--------------------+
|      categoryPathId|
+--------------------+
|[summer|Summer, w...|
|[ab|ba, caa|da]     |
|                  []|
|[shop-all|Shop Al...|
+--------------------+

The each and every value of the list contains two values separated with pipe symbol(|).

It will be like this [ab|ba, caa|da]. I want to remove the second word (i.e. after pipe symbol) in each and every value of the list. The expected result like this [ab,caa].

Can you help me to solve this...

Shubham Jain · Accepted Answer

Spark2.4+

You can use higher order function to perform this operation

from pyspark.sql.functions import *
df = df.select(expr('''transform(categoryPathId, x->split(x,'\\|')[0])''').alias('categoryPathId1'))
df.show()
+---------------+
|categoryPathId1|
+---------------+
|        [a, c] |
|        [a, c] |
|        [a, c] |
+---------------+

How to remove some character from the list based on pattern in PySpark Dataframe

Answers (1)

Related Questions