TheBeginner
TheBeginner

Reputation: 415

How can I recognise Non printable unicode characters in python

I am trying to generate a Unicode string using random characters. I don't want to have non-printtable characters in a string. using 'unichr(codepoint)' function I am converting codepoint to Unicode and using 'unicode.encode('utf-8')' I am converting Unicode to string. I tried using string.printable but that covers only ASCII.

Upvotes: 1

Views: 3193

Answers (1)

Markus Jarderot
Markus Jarderot

Reputation: 89171

You could use the unicodedata library.

import unicodedata

def strip_string(self, string):
  """Cleans a string based on a whitelist of printable unicode categories
  You can find a full list of categories here:
  http://www.fileformat.info/info/unicode/category/index.htm
  """
  letters     = ('LC', 'Ll', 'Lm', 'Lo', 'Lt', 'Lu')
  numbers     = ('Nd', 'Nl', 'No')
  marks       = ('Mc', 'Me', 'Mn')
  punctuation = ('Pc', 'Pd', 'Pe', 'Pf', 'Pi', 'Po', 'Ps')
  symbol      = ('Sc', 'Sk', 'Sm', 'So')
  space       = ('Zs',)

  allowed_categories = letters + numbers + marks + punctuation + symbol + space

  return u''.join([ c for c in string if unicodedata.category(c) in allowed_categories ])

Source: https://gist.github.com/Jonty/6705090

Upvotes: 1

Related Questions