Reputation: 686
I am reading a word file using below code :
import win32com.client as win32
word = win32.dynamic.Dispatch("Word.Application")
word.Visible = 0
doc = word.Documents.Open(SigLexiconFilePath)
I get strings from the file which has lot of non-printable characters:
str = "\xa0keine\xa0freigäbü\xa0\x0b\r\x07"
I tried below codes to remove non-printable characters :
import string
str = "\xa0keine\xa0freigäbü\xa0\x0b\r\x07"
filtered_string = "".join(filter(lambda x:x in string.printable, str))
This gives me below output:
keinefreigb\x0b\r
Other piece of code which I tried :
str = str.split('\r')[0]
str = str.strip()
This gives me below output:
keine\xa0freigäbü
How can i remove all these non-printable characters to get below desired output using minimum code :
keine freigäbü
Upvotes: 0
Views: 3682
Reputation: 101
An elegant pythonic solution to stripping 'non printable' characters from a string in python is to use the isprintable() string method together with a generator expression or list comprehension depending on the use case ie. size of the string:
''.join(c for c in str if c.isprintable())
returns 'keinefreigäbü'
str.isprintable() Return True if all characters in the string are printable or the string is empty, False otherwise. Nonprintable characters are those characters defined in the Unicode character database as “Other” or “Separator”, excepting the ASCII space (0x20) which is considered printable. (Note that printable characters in this context are those which should not be escaped when repr() is invoked on a string. It has no bearing on the handling of strings written to sys.stdout or sys.stderr.)
Upvotes: 3
Reputation: 421
Try with this lines.
import re
def convert_tiny_str(x:str):
""" Taking in consideration this:
> https://www.ascii-code.com/
Citting: "The first 32 characters in the ASCII-table are unprintable control
codes and are used to control peripherals such as printers."
From Hex code 00 to Hec code 2F, [00, 2F].
Now, from ASCII Extended, the printable characters are listed
from \x20 to \xFF in Hexadecimal code, [20, FF].
For that the Regular Expression that I can show like a possible
solution it is this:
1- Replace "all the characers, except the printable characters", by a ''.
2- Then, the character \xa0 it is still componing the str result.
Replace it by an ' '.
"""
_out = re.sub(r'[^\x20-\xff]',r'', _str)
# >> '\xa0keine\xa0freigäbü\xa0'
return re.sub(r'\xa0',r' ', _out)
_str = "\xa0keine\xa0freigäbü\xa0\x0b\r\x07"
x = convert_tiny_str(_str)
print(x)
# >>' keine freigäbü '
Done.
Upvotes: 1
Reputation: 327
These characters all seem to be white space characters. You may try Python's unicodedata module to convert some of them to proper white space characters consistently:
>>> unicodedata.normalize("NFKD","\xa0keine\xa0freigäbü\xa0\x0b\r\x07")
' keine freigäbü \x0b\r\x07'
You can then maybe go through a series of replacements and a strip command to get what you want if the set of characters you are trying to remove are not that many.
>>> ' keine freigäbü \x0b\r\x07'.replace("\x0b"," ").replace("\r"," ").\
replace("\x07"," ").strip()
'keine freigäbü'
Hope these help.
Upvotes: 1