Reputation: 139
I am working on pyspark dataframe and I have a column of words
(array<string> type)
. What should be the regex pattern to remove numeric values and numeric values from words?
+---+----------------------------------------------+
|id | words |
+---+----------------------------------------------+
|564|[fhbgtrj5, 345gjhg, ghth578ghu, 5897, fhrfu44]|
+---+----------------------------------------------+
expected output:
+---+----------------------------------------------+
|id |words |
+---+----------------------------------------------+
|564| [fhbgtrj, gjhg, ghthghu, fhrfu]|
+---+----------------------------------------------+
Please help.
Upvotes: 0
Views: 2746
Reputation: 42352
You can use transform
together with regexp_replace
to remove the numbers, and use array_remove
to remove the empty entries (which comes from those entries which only consist of numbers).
df2 = df.withColumn(
'words',
F.expr("array_remove(transform(words, x -> regexp_replace(x, '[0-9]', '')), '') as words")
)
df2.show(truncate=False)
+---+-------------------------------+
|id |words |
+---+-------------------------------+
|564|[fhbgtrj, gjhg, ghthghu, fhrfu]|
+---+-------------------------------+
Upvotes: 1