Reputation: 14721

Remove non-alphanumeric characters by regex substitution

I have this code and I want to remove the non-alphanumeric characters. The problem is it removes the Arabic words as well. How can i keep Arabic characters and remove just the non alphanumeric characters.

# -*- coding: utf-8 -*-
import re
hello = u"سلام .@#(*&"
print re.sub(r'\W+', '', hello)

It outputs empty string.

But I want this:

"سلام"

Upvotes: 1

Answers (3)

Ahmed Osama

Reputation: 802

I had the same problem till I found this jquery solution ,

function slugify(text)
{
  return text.toString().toLowerCase()
    .replace(/[^\w^\u0600-\u06FF]+/g, '-')  // Remove all non-word chars and replace spaced with "-" respects arabic characters
    .replace(/\-\-+/g, '-')         // Replace multiple - with single -
    .replace(/^-+/, '')             // Trim - from start of text
    .replace(/-+$/, '');            // Trim - from end of text
}

I wanted to make a slug generator who respects Arabic characters , the idea is to identify the arabic characters in the regular expression so this is the final result , hope it helps :

// slug creation
$(document).ready(function(){
  $("#name").change(function(){
  $postTitle = document.getElementById("name").value;
  $slugTitle = slugify($postTitle);
  document.getElementById("slug").value = $slugTitle;
  });
});


function slugify(text)
{
  return text.toString().toLowerCase()
    .replace(/[^\w^\u0600-\u06FF]+/g, '-')  // Remove all non-word chars and replace spaced with "-" respects arabic characters
    .replace(/\-\-+/g, '-')         // Replace multiple - with single -
    .replace(/^-+/, '')             // Trim - from start of text
    .replace(/-+$/, '');            // Trim - from end of text
}

Upvotes: 2

2rs2ts

Reputation: 11026

Edit: I realized there is a simpler answer. Just turn unicode mode on.

re.sub(r'\W', '', hello, flags=re.UNICODE)

In Python 3 this flag is unnecessary because of how Python 3 handles unicode strings. See https://stackoverflow.com/a/393915/691859 for more information.

(Old answer)

You need to define the character class that you actually want to keep. Since you're dealing with unicode characters you will want to construct a character class that includes your characters... I'm no unicode expert and I also can't read Arabic, but let's go with what wikipedia says is the Arabic unicode block which is U-0600 to U-06FF.

>>> re.sub(ur'[^\u0600-\u06FF]', '', hello)
u'\u0633\u0644\u0627\u0645'

The secret sauce is to make your regex itself also a unicode string, so you can put in the unicode escape sequences for the Arabic unicode block.

As others pointed out, \W means [^\w] which encapsulates the Arabic block. If you want everything but Arabic and latin alphanumeric characters, you can use [^\w\u0600-\u06FF].

[] means character class.
^ means everything but what you're about to put in the class.
\w means A-Z, a-z, _, and 0-9.
\u0600 is the unicode escape for the first character in the Arabic unicode block.
- means "everything from to "
\u06FF is the unicode escape for the last character in the Arabic unicode block.

Upvotes: 2

deweyredman

Reputation: 1450

This happens because the arabic character is not a "word" character in the traditional sense...

see here

the relevant text:

"\w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]"

...

"The above three shorthands also have negated versions. \D is the same as [^\d], \W is short for [^\w] and \S is the equivalent of [^\s]."

# -*- coding: utf-8 -*-
import re
hello = u"سلام .@#(*&"
print re.sub(ur'[^\w^\u0600-\u06FF]', '', hello)

Upvotes: 2

Remove non-alphanumeric characters by regex substitution

Answers (3)

Related Questions