Regular expression to clean strings of words(with accents) and numbers from spaces or other characters at the beggining or end

Question

I'm using python with spark to treat some data with accent words in portuguese.

Some examples of data comes are like this:

 .. -- Água, 1234 ...

 - -- https://www.example.com/page.html *****

I'm trying to remove anything that is not a word or number from the left or right of the string, getting clean results like this:

   Água, 1234
   https://www.example.com/page.html

The best I could do is this:

 ^[^\p{N}\p{L}]]|[^\p{N}\p{L}]$

But this didn't work. I saw a lot solutions but non matching the beginning and end of string with accent characters.

Thanks in advance.

Luiz Fernando Lobo · Accepted Answer

I was able to do it.

Thanks to αԋɱҽԃ αмєяιcαη, It's not the best solution because it goes outside of regexp_replace function of pyspark but it works, just added the re.unicode flag, and created a udf.


regexp = re.compile(r'^\W+|\W+$',flags=re.UNICODE)

def remove_non_utf8(string):
    return regexp_2.sub('',regexp_1.sub('',string))

replace_utf8 = udf(remove_non_utf8)

This removes all non unicode characters from the begining or end, used this url as reference.

--EDIT--

I tried using:

**(?ui)^\W+|\W+$**

With the function regexp_replace of pyspark, it didn't work so I'm still with the regexp solution.

Regular expression to clean strings of words(with accents) and numbers from spaces or other characters at the beggining or end

Answers (2)

Demo

Test

Output

Related Questions