FlamingMoe
FlamingMoe

Reputation: 2994

Extract one or more consecutive words with first capital letter

I did a regular expression to extract one or more consecutive words with first capital letter. I need it with accented letters, but those letters screw up the expression, generating false output.

This is the example: http://www.phpliveregex.com/p/eHE (select preg_match_all)

My regular expression:

/([ÁÉÍÓÚÑA-Z]+[a-záéíóúñ]*[\s]{0,1}){1,}/

Test string:

Esto es una prueba para extraer diferentes nombres de personas como Fernández Díaz, Logroño, la Comunidad Valenciana, o también siglas como AVE, y cualquier cosa que empiece por mayúscula y tenga una o varias palabras.

In this case, "úscula", "én" should not appear.

Upvotes: 1

Views: 196

Answers (2)

trincot
trincot

Reputation: 350270

As indicated in comments, the way to match letters including all accented versions, is to make use of the \p escape sequence in combination with the u (unicode) modifier:

additional escape sequences to match generic character types are available when UTF-8 mode is selected.

\p{xx}
    a character with the xx property

L     Letter Includes the following properties: Ll, Lm, Lo, Lt and Lu.
Ll    Lower case letter
Lm  Modifier letter
Lo   Other letter
Lt   Title case letter
Lu   Upper case letter

You could thus use this regex:

\b(?![\h,])(?:[\h,]*\p{Lu}\pL*)+

This expression checks that the match does not start with a horizontal white space (\h) nor a comma, but then matches words separated by those. You could remove the comma if this is not what you want, or on the other hand add other punctuation to that list if you want.

Note that PHP requires that you use braces when you put more than one letter after the \p modifier.

See PHP Live Regex

Example code (see it on eval.in):

$text = "Esto es una prueba para extraer diferentes nombres de personas " .
        "como Fernández Díaz, Logroño, la Comunidad Valenciana, o también " .
        "siglas como AVE, y cualquier cosa que empiece por mayúscula " .
        "y tenga una o varias palabras.";

preg_match_all('/\b(?![\h,])(?:[\h,]*\p{Lu}\pL*)+/u', $text, $matches); 

var_export($matches);

Output:

array (
  0 => 
  array (
    0 => 'Esto',
    1 => 'Fernández Díaz, Logroño',
    2 => 'Comunidad Valenciana',
    3 => 'AVE',
  ),
)

Without the commas in the regex, 'Fernández Díaz, Logroño' would end up in separate matches:

array (
  0 => 
  array (
    0 => 'Esto',
    1 => 'Fernández Díaz',
    2 => 'Logroño',
    3 => 'Comunidad Valenciana',
    4 => 'AVE',
  ),
)

Upvotes: 2

lafor
lafor

Reputation: 12776

preg_match_all('/(\b\p{Lu}\p{L}+\s*)+/u', $input, $output);

That's assuming "word" consists of letters only and only words separated by whitespace characters are considered consecutive.

Demo: http://www.phpliveregex.com/p/eHG

Upvotes: 2

Related Questions