Reputation: 14721
I have this code and I want to remove the non-alphanumeric characters. The problem is it removes the Arabic words as well. How can i keep Arabic characters and remove just the non alphanumeric characters.
# -*- coding: utf-8 -*-
import re
hello = u"سلام .@#(*&"
print re.sub(r'\W+', '', hello)
It outputs empty string.
But I want this:
"سلام"
Upvotes: 1
Views: 2991
Reputation: 802
I had the same problem till I found this jquery solution ,
function slugify(text)
{
return text.toString().toLowerCase()
.replace(/[^\w^\u0600-\u06FF]+/g, '-') // Remove all non-word chars and replace spaced with "-" respects arabic characters
.replace(/\-\-+/g, '-') // Replace multiple - with single -
.replace(/^-+/, '') // Trim - from start of text
.replace(/-+$/, ''); // Trim - from end of text
}
I wanted to make a slug generator who respects Arabic characters , the idea is to identify the arabic characters in the regular expression so this is the final result , hope it helps :
// slug creation
$(document).ready(function(){
$("#name").change(function(){
$postTitle = document.getElementById("name").value;
$slugTitle = slugify($postTitle);
document.getElementById("slug").value = $slugTitle;
});
});
function slugify(text)
{
return text.toString().toLowerCase()
.replace(/[^\w^\u0600-\u06FF]+/g, '-') // Remove all non-word chars and replace spaced with "-" respects arabic characters
.replace(/\-\-+/g, '-') // Replace multiple - with single -
.replace(/^-+/, '') // Trim - from start of text
.replace(/-+$/, ''); // Trim - from end of text
}
Upvotes: 2
Reputation: 11026
Edit: I realized there is a simpler answer. Just turn unicode mode on.
re.sub(r'\W', '', hello, flags=re.UNICODE)
In Python 3 this flag is unnecessary because of how Python 3 handles unicode strings. See https://stackoverflow.com/a/393915/691859 for more information.
(Old answer)
You need to define the character class that you actually want to keep. Since you're dealing with unicode characters you will want to construct a character class that includes your characters... I'm no unicode expert and I also can't read Arabic, but let's go with what wikipedia says is the Arabic unicode block which is U-0600 to U-06FF.
>>> re.sub(ur'[^\u0600-\u06FF]', '', hello)
u'\u0633\u0644\u0627\u0645'
The secret sauce is to make your regex itself also a unicode string, so you can put in the unicode escape sequences for the Arabic unicode block.
As others pointed out, \W
means [^\w]
which encapsulates the Arabic block. If you want everything but Arabic and latin alphanumeric characters, you can use [^\w\u0600-\u06FF]
.
[]
means character class.^
means everything but what you're about to put in the class.\w
means A-Z, a-z, _, and 0-9.\u0600
is the unicode escape for the first character in the Arabic unicode block.-
means "everything from to "\u06FF
is the unicode escape for the last character in the Arabic unicode block.Upvotes: 2
Reputation: 1450
This happens because the arabic character is not a "word" character in the traditional sense...
see here
the relevant text:
"\w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]"
...
"The above three shorthands also have negated versions. \D is the same as [^\d], \W is short for [^\w] and \S is the equivalent of [^\s]."
# -*- coding: utf-8 -*-
import re
hello = u"سلام .@#(*&"
print re.sub(ur'[^\w^\u0600-\u06FF]', '', hello)
Upvotes: 2