Is there way to remove only BAD characters from a string in Python/pandas?

I am trying to read a PDF using Camelot library and store it to a dataframe. The resulting dataframe has garbled/bad characters in string fields.

Eg: 123Rise â€“ Tower & Troe's Mechâ€“

I want to remove ONLY the Garbled characters and keep everything else including symbols.

I tried regex such as these [^\w.,&,'-\s] to only keep desirable values. But I'm having to add every special character which need not be removed into this. I cannot ditch Camelot library as well.

Is there a way to solve this ??

Upvotes: 2

Answers (4)

Prashant Maurya

Reputation: 678

Removing non-ASCII characters using regex will be fast:

import re
text = "123Rise â€“ Tower & Troe's Mechâ€“"
re.sub(r'[^\x00-\x7F]+','', text)

The output will be:

"123Rise  Tower & Troe's Mech"

Upvotes: 1

Jack Hales

Reputation: 1644

Another way I commonly use for filtering out non-ascii garbage and may be relevant (or not) is:

# Your "messy" data in question.
string = "123Rise â€“ Tower & Troe's Mechâ€“"

# Iterate over each character, and filter by only ord(c) < 128.
clean = "".join([c for c in string if ord(c) < 128])

What is ord? Ord (as I understand it) converts a character to its binary/ascii numeric representation. You can use this to your advantage, by filtering only numbers less than 128 (as above) which will limit your text range to basic ascii and no unicode stuff without having to work with messy encodings.

Hope that helps!

Upvotes: 1

Sharmiko

Reputation: 623

One way to achieve that, is to remove non-ASCII characters.

my_text = "123Rise â€“ Tower & Troe's Mechâ€“"
my_text = ''.join([char if ord(char) < 128 else '' for char in my_text])
print(my_text)

Result:

123Rise  Tower & Troe's Mech

Also you can use this website as reference to normal and extended ASCII characters.

Upvotes: 1

Cow

Reputation: 3040

You could try to use unicodedata library to normalize the data you have, for example:

import unicodedata

def formatString(value, allow_unicode=False):
    value = str(value)
    if allow_unicode:
        value = unicodedata.normalize('NFKC', value)
    else:
        value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    return(value)

print(formatString("123Rise â€“ Tower & Troe's Mechâ€“"))

Result:

123Rise a Tower & Troe's Mecha

Upvotes: 2

Is there way to remove only BAD characters from a string in Python/pandas?

Answers (4)

Related Questions