merkle
merkle

Reputation: 1815

Remove the repeated punctuation from pyspark dataframe

I need to remove the repeated punctuations and keep the last occurrence only.

For example: !!!! -> !
             !!$$ -> !$

I have a dataset that looks like below

temp = spark.createDataFrame([
    (0, "This is Spark!!!!"),
    (1, "I wish Java could use case classes!!##"),
    (2, "Data science is  cool#$@!"),
    (3, "Machine!!$$")
], ["id", "words"])

+---+--------------------------------------+
|id |words                                 |
+---+--------------------------------------+
|0  |This is Spark!!!!                     |
|1  |I wish Java could use case classes!!##|
|2  |Data science is  cool#$@!             |
|3  |Machine!!$$                             |
+---+--------------------------------------+

I tried regex to remove specific punctuations and that is below

df2 = temp.select(
    [F.regexp_replace(col, r',|\.|&|\\|\||-|_', '').alias(col) for col in temp.columns]
)

but the above is not working. Can anyone tell how to achieve this in pyspark?

Below is the desired output.

    id  words
0   0   This is Spark!
1   1   I wish Java could use case classes!#
2   2   Data science is cool#$@!
3   3   Machine!$

Upvotes: 1

Views: 138

Answers (1)

Emma
Emma

Reputation: 9308

You can use this regex.

df2 = temp.select('id',
    F.regexp_replace('words', r'([!$#])\1+', '$1').alias('words'))

Regex explanation.

(   -> Group anything between this and ) and create a capturing group
[   -> Match any characters between this and ]

([!$#]) -> Create the capturing group that match any of !, $, #

\1  -> Reference the first capturing group
+   -> Match 1 or more of a preceding group or character

([!$#])\1+ -> Match any of !, $, # that repeats more than 1 time.

And the last argument of regex_replace to set $1 which is referencing the first capturing group (a single character of !, $, #) to replace the repeating characters with just the single character.

You can add more characters between [] for matching more special characters.

Upvotes: 1

Related Questions