T.SURESH ARUNACHALAM
T.SURESH ARUNACHALAM

Reputation: 285

How to remove some character from the list based on pattern in PySpark Dataframe

This is the field contains list.

+--------------------+
|      categoryPathId|
+--------------------+
|[summer|Summer, w...|
|[ab|ba, caa|da]     |
|                  []|
|[shop-all|Shop Al...|
+--------------------+

The each and every value of the list contains two values separated with pipe symbol(|).

It will be like this [ab|ba, caa|da]. I want to remove the second word (i.e. after pipe symbol) in each and every value of the list. The expected result like this [ab,caa].

Can you help me to solve this...

Upvotes: 1

Views: 158

Answers (1)

Shubham Jain
Shubham Jain

Reputation: 5536

Spark2.4+

You can use higher order function to perform this operation

from pyspark.sql.functions import *
df = df.select(expr('''transform(categoryPathId, x->split(x,'\\\\|')[0])''').alias('categoryPathId1'))
df.show()
+---------------+
|categoryPathId1|
+---------------+
|        [a, c] |
|        [a, c] |
|        [a, c] |
+---------------+

Upvotes: 2

Related Questions