Reputation: 343
Good afternoon everyone, I have a problem to clear special characters in a string column of the dataframe, I just want to remove special characters like html components, emojis and unicode errors, for example \u2013
.
does anyone have an regular expression to help me? Or any suggestions on how to treat this problem?
input:
i want to remove 😃 and codes "\u2022"
expected output:
i want to remove and codes
I tried:
re.sub('[^A-Za-z0-9 \u2022]+', '', nome)
regexp_replace('nome', '\r\n|/[\x00-\x1F\x7F]/u', ' ')
df = df.withColumn( "value_2", F.regexp_replace(F.regexp_replace("value", "[^\x00-\x7F]+", ""), '""', '') )
df = df.withColumn("new",df.text.encode('ascii', errors='ignore').decode('ascii'))
tried some solutions but none recognizes the character "\u2013", has anyone experienced this?
Upvotes: 3
Views: 2476
Reputation: 32700
You can use this regex to remove all unicode caracters from the column with regexp_replace
function. Then remove extra double quotes that can remain:
import pyspark.sql.functions as F
df = spark.createDataFrame([('i want to remove 😃 and codes "\u2022"',)], ["value"])
df = df.withColumn(
"value_2",
F.regexp_replace(F.regexp_replace("value", "[^\x00-\x7F]+", ""), '""', '')
)
df.show(truncate=False)
#+---------------------------------+----------------------------+
#|value |value_2 |
#+---------------------------------+----------------------------+
#|i want to remove 😃 and codes "•"|i want to remove and codes |
#+---------------------------------+----------------------------+
Upvotes: 2
Reputation: 856
import re
L=re.findall(r"[^😃•]+", "abdasfrasdfadfs😃adfaa😃•sdf•adsfasfasfasf")
print(L) # prints ['abdasfrasdfadfs', 'adfaa', 'sdf', 'adsfasfasfasf']
So to remove the smiley and bullet emoji(\u2022), You apply that pattern above, call findall method, and then join the returned list. Like below:
import re
given_string = "😃Your input• string 😃"
result_string = "".join(re.findall(r"[^😃•]+", given_string))
print(result_string) #prints 'Your input string '
if you know unicode number of emojis you can replace emojis with unicode number like below:
result_string = "".join(re.findall(r"[^😃\u2022]+", given_string))
Upvotes: 0