Reputation: 11

Remove ascii characters in a string python

I want to delete special characters in the string. However, I was not successful. can you help me?

It shows two "" each, but when you print it becomes only "". Why could it be?.

Data Update:

data = [{
            "data": "0\\x1e\\x82*.extractdomain.com\\x82\\x0ctest.extractdomain.com",
            "name": "subjectAltName"
        }]

re.sub("[^\x20-\x7E]", "", data["data"])

Upvotes: 1

Answers (5)

user13824946

Reputation:

Try this approach

import re


def delete_punc(s):

  s1 = s.split()

  match_pattern1 = re.findall(r'[a-zA-Z]', (str(s1[0])))
  match_pattern2 = re.findall(r'[a-zA-Z]', (str(s1[1])))



  listToStr1 = ''.join([str(elem) for elem in match_pattern1])
  listToStr2 = ''.join([str(elem) for elem in match_pattern2])

  return listToStr1 + ' ' + listToStr2

print(delete_punc("He3l?/l!o W{o'r[l9\d)"))

output

Hello World

Upvotes: 0

snakecharmerb

Reputation: 55600

It looks as if the string contains \x escapes which have themselves been escaped, leading to the doubled backslashes. Perhaps you received the data like this, or perhaps some earlier processing has corrupted the data. The doubled backslashes can be removed by encoding the string as bytes and then decoding with the unicode-escape codec. After this, your regex will work.

>>> s = "0\\x1e\\x82*.extractdomain.com\\x82\\x0ctest.extractdomain.com"
>>> fixed = s.encode('latin-1').decode('unicode-escape')
>>> fixed
'0\x1e\x82*.extractdomain.com\x82\x0ctest.extractdomain.com'
>>> re.sub("[^\x20-\x7E]", "", fixed)
'0*.extractdomain.comtest.extractdomain.com'

Upvotes: 0

AM Z

Reputation: 431

Try this.

clean_text = ' '.join(re.findall(r"[^\W]+", text))

EDIT: or this.

custom_translation = {130: None, 22: None}
print(text.translate(custom_translation))

The post has been edited "text changed" and this solution isn't working anymore. Old text was

text = '0:\x82 test test test\x82\x16testtesttest'

Newer Solution:

custom_translation = {22: None, 49: None, 50: None, 54: None, 56: None, 92: None, 120: None}
print(text.translate(custom_translation))

Upvotes: 2

rootkonda

Reputation: 1743

txt = "0:\\x82 test test test\\x82\\x16testtesttest"
x = re.sub("\\\\(?:x16|x82)", "", txt)

As a generalization of such characters:

x = re.sub("\\\\(?:x\w\w)", "", txt)

Output:

0: test test testtesttesttest

Good to know:

In short, to match a literal backslash, one has to write '\\' as the RE string, because the regular expression must be \, and each backslash must be expressed as \ inside a regular Python string literal. In REs that feature backslashes repeatedly, this leads to lots of repeated backslashes and makes the resulting strings difficult to understand.

Another way is to use Python’s raw string notation for regular expressions; backslashes are not handled in any special way in a string literal prefixed with 'r', so r"\n" is a two-character string containing '' and 'n', while "\n" is a one-character string containing a newline. Regular expressions will often be written in Python code using this raw string notation.

For more examples - Backslash Plague

Upvotes: 1

Cyril Jouve

Reputation: 1040

the error is in the declaration of text, you double escape the \, so you are writing a plain \ instead of escaping an hexadecimal char

text = '0:\x82 test test test\x82\x16testtesttest'

print(re.sub("[^\x20-\x7E]", "", text))

prints: 0: test test testtesttesttest

Upvotes: 0

Remove ascii characters in a string python

Answers (5)

Related Questions