\n
var accentedCharacters = "àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ";\n// Build the full regex\nvar regex = "^[a-zA-Z" + accentedCharacters + "]+,\\\\s[a-zA-Z" + accentedCharacters + "]+$";\n// Create a RegExp from the string version\nregexCompiled = new RegExp(regex);\n// regexCompiled = /^[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+,\\s[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+$/\n
\naccentedCharacters
..
character class, to have a simpler expression:var regex = /^.+,\\s.+$/;\n
\nsomething, something
. That's alright I suppose.../^[a-zA-Z\\u00C0-\\u017F]+,\\s[a-zA-Z\\u00C0-\\u017F]+$/\n
\nHere are my concerns:
\nThe first solution is far too limiting, and sloppy and convoluted at that. It would need to be changed if I forgot a character or two, and that's just not very practical.
\nThe second solution is better, concise, but it probably matches far more than it actually should. I couldn't find any real documentation on exactly what .
matches, just the generalization of "any character except the newline character" (from a table on the MDN).
The third solution seems the be the most precise, but are there any gotchas? I'm not very familiar with Unicode, at least in practice, but looking at a code table/continuation of that table, \\u00C0-\\u017F
seems to be pretty solid, at least for my expected input.
Which of these three approaches is most suited for the task? Or are there better solutions?
\n","author":{"@type":"Person","name":"Chris Cirefice"},"upvoteCount":310,"answerCount":11,"acceptedAnswer":{"@type":"Answer","text":"The easier way to accept all accents is this:
\n[A-zÀ-ú] // accepts lowercase and uppercase characters\n[A-zÀ-ÿ] // as above, but including letters with an umlaut (includes [ ] ^ \\ × ÷)\n[A-Za-zÀ-ÿ] // as above but not including [ ] ^ \\\n[A-Za-zÀ-ÖØ-öø-ÿ] // as above, but not including [ ] ^ \\ × ÷\n
\nSee Unicode Character Table for characters listed in numeric order.
\n","author":{"@type":"Person","name":"Maycow Moura"},"upvoteCount":522}}}Reputation: 5795
I've looked on Stack Overflow (replacing characters.. eh, how JavaScript doesn't follow the Unicode standard concerning RegExp, etc.) and haven't really found a concrete answer to the question "How can JavaScript match accented characters (those with diacritical marks)?"
I'm forcing a field in a UI to match the format: last_name, first_name
(last [comma space] first), and I want to provide support for diacritics, but evidently in JavaScript it's a bit more difficult than other languages/platforms.
This was my original version, until I wanted to add diacritic support:
/^[a-zA-Z]+,\s[a-zA-Z]+$/
Currently I'm debating one of three methods to add support, all of which I have tested and work (at least to some extent, I don't really know what the "extent" is of the second approach). Here they are:
var accentedCharacters = "àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ";
// Build the full regex
var regex = "^[a-zA-Z" + accentedCharacters + "]+,\\s[a-zA-Z" + accentedCharacters + "]+$";
// Create a RegExp from the string version
regexCompiled = new RegExp(regex);
// regexCompiled = /^[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+,\s[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+$/
accentedCharacters
..
character class, to have a simpler expression:var regex = /^.+,\s.+$/;
something, something
. That's alright I suppose.../^[a-zA-Z\u00C0-\u017F]+,\s[a-zA-Z\u00C0-\u017F]+$/
Here are my concerns:
The first solution is far too limiting, and sloppy and convoluted at that. It would need to be changed if I forgot a character or two, and that's just not very practical.
The second solution is better, concise, but it probably matches far more than it actually should. I couldn't find any real documentation on exactly what .
matches, just the generalization of "any character except the newline character" (from a table on the MDN).
The third solution seems the be the most precise, but are there any gotchas? I'm not very familiar with Unicode, at least in practice, but looking at a code table/continuation of that table, \u00C0-\u017F
seems to be pretty solid, at least for my expected input.
Which of these three approaches is most suited for the task? Or are there better solutions?
Upvotes: 310
Views: 237429
Reputation: 3
The following regex will match every diacritic-ed character in the Latin block, except for Latin Extended C, D, E, F, G, which are all medieval stuff
/[\u00C0-\u00C5\u00C7-\u00CF\u00D1-\u00D6\u00D9-\u00DD\u00E0-\u00E5\u00E7-\u00EF\u00F1-\u00F6\u00F8-\u00FD\u00FF\u0100-\u0130\u0134-\u0137\u0139-\u0148\u014C-\u0151\u0154-\u017E\u0180-\u0183\u0187-\u0188\u018A-\u018C\u0191-\u0193\u0197-\u019B\u019D-\u01A1\u01A4-\u01A5\u01AB-\u01B0\u01B2-\u01B6\u01BA-\u01BB\u01BE\u01CD-\u01DC\u01DE-\u01F0\u01F4-\u01F5\u01F8-\u021B\u021E-\u0221\u0224-\u0236\u023A-\u0240\u0243\u0246-\u024F\u1E00-\u1E9D\u1EA0-\u1EF9\u1EFE-\u1EFF]/
Upvotes: 0
Reputation: 1594
You can remove the diacritics from alphabets by using:
let str = "résumé"
let result = str.normalize('NFD').replace(/\p{Diacritic}/gu, '') // returns resume
console.log(result)
It will remove all the diacritical marks, and then perform your regex on it.
Reference:
Searching and sorting text with diacritical marks in JavaScript
Unicode Character Class Escape
Upvotes: 10
Reputation: 6959
The easier way to accept all accents is this:
[A-zÀ-ú] // accepts lowercase and uppercase characters
[A-zÀ-ÿ] // as above, but including letters with an umlaut (includes [ ] ^ \ × ÷)
[A-Za-zÀ-ÿ] // as above but not including [ ] ^ \
[A-Za-zÀ-ÖØ-öø-ÿ] // as above, but not including [ ] ^ \ × ÷
See Unicode Character Table for characters listed in numeric order.
Upvotes: 522
Reputation: 17
My context is slightly different and limited to French: I want to search text by allowing a mistake of accents.
For example, I want to find "maîtrisée", but the text to be searched is "... maitrisee ...". So, I used the regular expression /ma[i|î|ï]tris[e|é|è|ê|ë]/
in JavaScript.
In the expression, the '[' and ']' define a set of characters, and the '|' is an OR condition.
This page gives a list of accented characters: Diacritiques utilisés en français
Upvotes: -1
Reputation: 73888
/^[\pL\pM\p{Zs}.-]+$/u
Explanation:
\pL
- matches any kind of letter from any language\pM
- matches a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.)\p{Zs}
- matches a whitespace character that is invisible, but does take up spaceu
- Pattern and subject strings are treated as UTF-8Unlike other proposed regex (such as [A-Za-zÀ-ÖØ-öø-ÿ]
), this will work with all language specific characters, e.g. Šš
is matched by this rule, but not matched by others on this page.
Unfortunately, natively JavaScript does not support these classes. However, you can use xregexp
, e.g.
const XRegExp = require('xregexp');
const isInputRealHumanName = (input: string): boolean => {
return XRegExp('^[\\pL\\pM-]+ [\\pL\\pM-]+$', 'u').test(input);
};
Upvotes: 26
Reputation: 341
You can use this:
^([a-zA-Z]|[à-ú]|[À-Ú])+$
It will match every word with accented characters or not.
Upvotes: 9
Reputation: 758
From Wikipedia: Basic Latin
For Latin letters, I use
/^[A-zÀ-ÖØ-öø-ÿ]+$/
It avoids hyphens and specials characters.
Upvotes: 4
Reputation: 10427
The XRegExp library has a plugin named Unicode that helps solve tasks like this.
<script src="xregexp.js"></script>
<script src="addons/unicode/unicode-base.js"></script>
<script>
var unicodeWord = XRegExp("^\\p{L}+$");
unicodeWord.test("Русский"); // true
unicodeWord.test("日本語"); // true
unicodeWord.test("العربية"); // true
</script>
Upvotes: 17
Reputation: 2334
The accented Latin range \u00C0-\u017F
was not quite enough for my database of names, so I extended the regex to
[a-zA-Z\u00C0-\u024F]
[a-zA-Z\u00C0-\u024F\u1E00-\u1EFF] // includes even more Latin chars
I added these code blocks (\u00C0-\u024F
includes three adjacent blocks at once):
\u00C0-\u00FF
Latin-1 Supplement\u0100-\u017F
Latin Extended-A\u0180-\u024F
Latin Extended-B\u1E00-\u1EFF
Latin Extended AdditionalNote that \u00C0-\u00FF
is actually only a part of Latin-1 Supplement. It skips unprintable control signals and all symbols except for the awkwardly-placed multiply × \u00D7
and divide ÷ \u00F7
.
[a-zA-Z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u024F] // exclude ×÷
If you need more code points, you can find more ranges on Wikipedia's List of Unicode characters. For example, you could also add Latin Extended-C, D, and E, but I left them out because only historians seem interested in them now, and the D and E sets don't even render correctly in my browser.
The original regex stopping at \u017F
borked on the name "Șenol". According to FontSpace's Unicode Analyzer, that first character is \u0218
, LATIN CAPITAL LETTER S WITH COMMA BELOW. (Yeah, it's usually spelled with a cedilla-S \u015E
, "Şenol." But I'm not flying to Turkey to go tell him, "You're spelling your name wrong!")
Upvotes: 75
Reputation: 664970
Which of these three approaches is most suited for the task?
Depends on the task :-) To match exactly all Latin characters and their accented versions, the Unicode ranges probably provide the best solution. They might be extended to all non-whitespace characters, which could be done using the \S
character class.
I'm forcing a field in a UI to match the format:
last_name, first_name
(last [comma space] first)
The most basic problem I'm seeing here are not diacritics, but whitespaces. There are a few names that consist of multiple words, e.g. for titles. So you should go with the most generic, that is allowing everything but the comma that distinguishes first from last name:
/[^,]+,\s[^,]+/
But your second solution with the .
character class is just as fine, you only might need to care about multiple commata then.
Upvotes: 20