Anudocs
Anudocs

Reputation: 686

How to remove non-printable characters from string?

I am reading a word file using below code :

import win32com.client as win32

word = win32.dynamic.Dispatch("Word.Application")
word.Visible = 0
doc = word.Documents.Open(SigLexiconFilePath)

I get strings from the file which has lot of non-printable characters:

str = "\xa0keine\xa0freigäbü\xa0\x0b\r\x07"

I tried below codes to remove non-printable characters :

import string 

str = "\xa0keine\xa0freigäbü\xa0\x0b\r\x07"
filtered_string = "".join(filter(lambda x:x in string.printable, str))

This gives me below output:

keinefreigb\x0b\r

Other piece of code which I tried :

str = str.split('\r')[0]
str = str.strip()

This gives me below output:

keine\xa0freigäbü

How can i remove all these non-printable characters to get below desired output using minimum code :

keine freigäbü

Upvotes: 0

Views: 3682

Answers (3)

Thomas Juul Dyhr
Thomas Juul Dyhr

Reputation: 101

An elegant pythonic solution to stripping 'non printable' characters from a string in python is to use the isprintable() string method together with a generator expression or list comprehension depending on the use case ie. size of the string:

''.join(c for c in str if c.isprintable())

returns 'keinefreigäbü'

str.isprintable() Return True if all characters in the string are printable or the string is empty, False otherwise. Nonprintable characters are those characters defined in the Unicode character database as “Other” or “Separator”, excepting the ASCII space (0x20) which is considered printable. (Note that printable characters in this context are those which should not be escaped when repr() is invoked on a string. It has no bearing on the handling of strings written to sys.stdout or sys.stderr.)

Upvotes: 3

Franco Gil
Franco Gil

Reputation: 421

Try with this lines.

import re

def convert_tiny_str(x:str):
    """ Taking in consideration this:

    > https://www.ascii-code.com/

    Citting: "The first 32 characters in the ASCII-table are unprintable control
    codes and are used to control peripherals such as printers." 
    From Hex code 00 to Hec code 2F, [00, 2F].

    Now, from ASCII Extended, the printable characters are listed
    from \x20 to \xFF in Hexadecimal code, [20, FF].

    For that the Regular Expression that I can show like a possible
    solution it is this:

    1- Replace "all the characers, except the printable characters", by a ''.

    2- Then, the character \xa0 it is still componing the str result.
    Replace it by an ' '.
    """

    _out = re.sub(r'[^\x20-\xff]',r'', _str)
    # >> '\xa0keine\xa0freigäbü\xa0'

    return re.sub(r'\xa0',r' ', _out)


_str = "\xa0keine\xa0freigäbü\xa0\x0b\r\x07"
x = convert_tiny_str(_str)

print(x)
# >>' keine freigäbü '

Done.

Upvotes: 1

Alper
Alper

Reputation: 327

These characters all seem to be white space characters. You may try Python's unicodedata module to convert some of them to proper white space characters consistently:

>>> unicodedata.normalize("NFKD","\xa0keine\xa0freigäbü\xa0\x0b\r\x07")
' keine freigäbü \x0b\r\x07'

You can then maybe go through a series of replacements and a strip command to get what you want if the set of characters you are trying to remove are not that many.

>>> ' keine freigäbü \x0b\r\x07'.replace("\x0b"," ").replace("\r"," ").\
        replace("\x07"," ").strip()
'keine freigäbü'

Hope these help.

Upvotes: 1

Related Questions