Reputation:
I have a very big data set . I am wondering How I can remove all punctuation from a big dataset in pyspark? For example , . & \ | - _
Upvotes: 1
Views: 1518
Reputation: 42352
You can use regexp_replace
to remove the punctuations you specified using a regex expression:
import pyspark.sql.functions as F
df2 = df.select(
[F.regexp_replace(col, r',|\.|&|\\|\||-|_', '').alias(col) for col in df.columns]
)
Upvotes: 1