Reputation: 2994
I did a regular expression to extract one or more consecutive words with first capital letter. I need it with accented letters, but those letters screw up the expression, generating false output.
This is the example: http://www.phpliveregex.com/p/eHE (select preg_match_all)
My regular expression:
/([ÁÉÍÓÚÑA-Z]+[a-záéíóúñ]*[\s]{0,1}){1,}/
Test string:
Esto es una prueba para extraer diferentes nombres de personas como Fernández Díaz, Logroño, la Comunidad Valenciana, o también siglas como AVE, y cualquier cosa que empiece por mayúscula y tenga una o varias palabras.
In this case, "úscula", "én" should not appear.
Upvotes: 1
Views: 196
Reputation: 350270
As indicated in comments, the way to match letters including all accented versions, is to make use of the \p
escape sequence in combination with the u
(unicode) modifier:
additional escape sequences to match generic character types are available when UTF-8 mode is selected.
\p{xx}
a character with the xx propertyL Letter Includes the following properties: Ll, Lm, Lo, Lt and Lu.
Ll Lower case letter
Lm Modifier letter
Lo Other letter
Lt Title case letter
Lu Upper case letter
You could thus use this regex:
\b(?![\h,])(?:[\h,]*\p{Lu}\pL*)+
This expression checks that the match does not start with a horizontal white space (\h
) nor a comma, but then matches words separated by those. You could remove the comma if this is not what you want, or on the other hand add other punctuation to that list if you want.
Note that PHP requires that you use braces when you put more than one letter after the \p
modifier.
See PHP Live Regex
Example code (see it on eval.in):
$text = "Esto es una prueba para extraer diferentes nombres de personas " .
"como Fernández Díaz, Logroño, la Comunidad Valenciana, o también " .
"siglas como AVE, y cualquier cosa que empiece por mayúscula " .
"y tenga una o varias palabras.";
preg_match_all('/\b(?![\h,])(?:[\h,]*\p{Lu}\pL*)+/u', $text, $matches);
var_export($matches);
Output:
array (
0 =>
array (
0 => 'Esto',
1 => 'Fernández Díaz, Logroño',
2 => 'Comunidad Valenciana',
3 => 'AVE',
),
)
Without the commas in the regex, 'Fernández Díaz, Logroño' would end up in separate matches:
array (
0 =>
array (
0 => 'Esto',
1 => 'Fernández Díaz',
2 => 'Logroño',
3 => 'Comunidad Valenciana',
4 => 'AVE',
),
)
Upvotes: 2
Reputation: 12776
preg_match_all('/(\b\p{Lu}\p{L}+\s*)+/u', $input, $output);
That's assuming "word" consists of letters only and only words separated by whitespace characters are considered consecutive.
Demo: http://www.phpliveregex.com/p/eHG
Upvotes: 2