Luiz Fernando Lobo
Luiz Fernando Lobo

Reputation: 704

Regular expression to clean strings of words(with accents) and numbers from spaces or other characters at the beggining or end

I'm using python with spark to treat some data with accent words in portuguese.

Some examples of data comes are like this:

 .. -- Água, 1234 ...

 - -- https://www.example.com/page.html *****

I'm trying to remove anything that is not a word or number from the left or right of the string, getting clean results like this:

   Água, 1234
   https://www.example.com/page.html

The best I could do is this:

 ^[^\\p{N}\\p{L}]]|[^\\p{N}\\p{L}]$

But this didn't work. I saw a lot solutions but non matching the beginning and end of string with accent characters.

Thanks in advance.

Upvotes: 1

Views: 315

Answers (2)

Luiz Fernando Lobo
Luiz Fernando Lobo

Reputation: 704

I was able to do it.

Thanks to αԋɱҽԃ αмєяιcαη, It's not the best solution because it goes outside of regexp_replace function of pyspark but it works, just added the re.unicode flag, and created a udf.


regexp = re.compile(r'^\W+|\W+$',flags=re.UNICODE)

def remove_non_utf8(string):
    return regexp_2.sub('',regexp_1.sub('',string))

replace_utf8 = udf(remove_non_utf8)

This removes all non unicode characters from the begining or end, used this url as reference.

--EDIT--

I tried using:

**(?ui)^\W+|\W+$** 

With the function regexp_replace of pyspark, it didn't work so I'm still with the regexp solution.

Upvotes: 0

Emma
Emma

Reputation: 27723

Maybe, it'd be OK that we'd look into the data you have, then we'd write some expression similar to:

(?i)\S[a-z].+[a-z0-9]

or,

(?i)\S*[a-z].+[a-z0-9]

Demo


If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


Test

import re


regex = r"(?i)\S[a-z].+[a-z0-9]"
string = """
.. -- Água, 1234 ...

 - -- https://www.example.com/page.html *****
"""

print(re.findall(regex, string))

Output

['Água, 1234', 'https://www.example.com/page.html']

Upvotes: 1

Related Questions