Reputation: 4244

How to keep only alphanumeric and space, and also ignore non-ASCII?

I have this line to remove all non-alphanumeric characters except spaces

re.sub(r'\W+', '', s)

Although, it still keeps non-English characters.

For example if I have

re.sub(r'\W+', '', 'This is a sentence, and here are non-english 托利 苏 !!11')

I want to get as output:

> 'This is a sentence and here are non-english  11'

Upvotes: 23

Answers (3)

Tilman Böckenförde

Reputation: 123

This might not be an answer to this concrete question but i came across this thread during my research.

I wanted to reach the same objective as the questioner but I wanted to include non English characters such as: ä,ü,ß, ...

The way the questioners code works, spaces will be deleted too.

A simple workaround is the following:

re.sub(r'[^ \w+]', '', string)

The ^ implies that everything but the following is selected. In this case \w, thus every word character (including non-English), and spaces.

I hope this will help someone in the future

Upvotes: 10

Malekai

Reputation: 5011

I once had this exact problem, the only difference was that I wasn't able to import anything or use regex.

To solve my problem I created a list containing all of the values I wanted to keep:

values = list("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 ")

Then I created a function that would loop through each item in the string and if it wasn't in the values list, it'd remove (replace) it from the string:

def remover(my_string = ""):
  for item in my_string:
    if item not in values:
      my_string = my_string.replace(item, "")
  return my_string

For example, the following code:

print(remover("H!e£l$l%o^ W&o*r(l)d!:)"))

Should output:

'Hello World'

Sure this isn't the best way to do this but given the circumstances, it was a quick and easy way to get job done.

NOTE: you can replace the items that are in the values list by changing if item not in values to if item in values.

NOTE: I wasn't allowed to use string constants because the string package has to be imported to use them.

Good luck.

Upvotes: 3

Nir Levy

Reputation: 4740

re.sub(r'[^A-Za-z0-9 ]+', '', s)

(Edit) To clarify: The [] create a list of chars. The ^ negates the list. A-Za-z are the English alphabet and is space. For any one or more of these (that is, anything that is not A-Z, a-z, or space,) replace with the empty string.

Upvotes: 52

How to keep only alphanumeric and space, and also ignore non-ASCII?

Answers (3)

Related Questions