Reputation: 704
I'm using python with spark to treat some data with accent words in portuguese.
Some examples of data comes are like this:
.. -- Água, 1234 ...
- -- https://www.example.com/page.html *****
I'm trying to remove anything that is not a word or number from the left or right of the string, getting clean results like this:
Água, 1234
https://www.example.com/page.html
The best I could do is this:
^[^\\p{N}\\p{L}]]|[^\\p{N}\\p{L}]$
But this didn't work. I saw a lot solutions but non matching the beginning and end of string with accent characters.
Thanks in advance.
Upvotes: 1
Views: 315
Reputation: 704
I was able to do it.
Thanks to αԋɱҽԃ αмєяιcαη, It's not the best solution because it goes outside of regexp_replace function of pyspark but it works, just added the re.unicode flag, and created a udf.
regexp = re.compile(r'^\W+|\W+$',flags=re.UNICODE)
def remove_non_utf8(string):
return regexp_2.sub('',regexp_1.sub('',string))
replace_utf8 = udf(remove_non_utf8)
This removes all non unicode characters from the begining or end, used this url as reference.
--EDIT--
I tried using:
**(?ui)^\W+|\W+$**
With the function regexp_replace of pyspark, it didn't work so I'm still with the regexp solution.
Upvotes: 0
Reputation: 27723
Maybe, it'd be OK that we'd look into the data you have, then we'd write some expression similar to:
(?i)\S[a-z].+[a-z0-9]
or,
(?i)\S*[a-z].+[a-z0-9]
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
import re
regex = r"(?i)\S[a-z].+[a-z0-9]"
string = """
.. -- Água, 1234 ...
- -- https://www.example.com/page.html *****
"""
print(re.findall(regex, string))
['Água, 1234', 'https://www.example.com/page.html']
Upvotes: 1