Mihail Minkov
Mihail Minkov

Reputation: 2633

RegEx to match either words separated by dash or just a single word

So, the requirement for this is to match last names of people, separated by a dash between each last name.

The base RegEx I am using for this is this one:

(?=\S*[-])([a-zA-ZÑñÁáÉéÍíÓóÚúÄäËëÏïÖöÜüÀàÈèÌìÒòÙù'-]+)

Basically I am limiting it to latin alphabet characters, including some accented characters.

This works perfectly fine if I use examples like:

But I forgot to contemplate the case when the person has only one last name.

I tried doing the following.

((?=\S*[-])([\ a-zA-ZÑñÁáÉéÍíÓóÚúÄäËëÏïÖöÜüÀàÈèÌìÒòÙù'-]+))|([A-Za-zÑñÁáÉéÍíÓóÚúÄäËëÏïÖöÜüÀàÈèÌìÒòÙù']+)

I added a \ or space in the allowed character for the fist match option. I added an or condition for a single word without spaces.

And while it works for some cases there are 2 issues.

  1. I don't think it's the most optimal RegEx for a use case like this.
  2. I stumbled upon the specific case with people who have complex last names.

Regarding point 2, I refer to something like:

The RegEx matches it, but it no longer respects the dash as a separator.

I am not sure how to handle this.

Also since I added the space it no longer respects the requirement for the dash between words.

What I am thinking is maybe limit the number of spaces between names, something like allow at most 2 or 3 spaces between a last name so that examples like:

Can be valid matches.

I am no pro on RegEx so some help would be greatly appreciated.

UPDATE

I did fail to mention I need to be able to use this with JavaScript. PHP could be useful too, but I am doing some browser validation and the patterns need to be compatible.

Upvotes: 3

Views: 1275

Answers (1)

mickmackusa
mickmackusa

Reputation: 47764

Logically, you should match one or more letters, then allow a single occurrence of your chosen delimiting characters before allowing another string of one or more letters.

PHP Code: (Demo)

$names = [
    'Pérez-González',
    'Domínguez-Díaz',
    'Güemez-Martínez',
    'Johnson-De Sosa',
    'Pérez-De la Cruz',
    'smith',
    'Pérez De la Cruz-González',
    'de Gal-O\'Connell',
    'Johnson--Johnson'
];

foreach ($names as $name) {
    echo "$name is " . (!preg_match("~^\pL+(?:[- ']\pL+)*$~u", $name) ? 'in' : '') . "valid\n";
}

Javascript Code: (snippet is runnable)

let names = [
      'Pérez-González',
      'Domínguez-Díaz',
      'Güemez-Martínez',
      'Johnson-De Sosa',
      'Pérez-De la Cruz',
      'smith',
      'Pérez De la Cruz-González',
      'de Gal-O\'Connell',
      'Johnson--Johnson'
    ],
    i,
    name;

for (i in names) {
    name = names[i];
    document.write("<div>" + name + " is " + (!name.match(/^\p{L}+(?:[- ']\p{L}+)*$/u) ? 'in' : '') + "valid</div>");
}

This will only allow a single delimiter between sequences of letters. This will fail if you someone's name is "Suzy 'Ng" because it has a space then an apostrophe (two consecutive delimiters). I don't know if this is possible/real, I just want to clarify.

No lookarounds are necessary.

Upvotes: 1

Related Questions