Reputation: 12193
I'm trying to match keywords in strings by filtering out only words that match the following criteria:
Example:
$string = "Joe O'Donnell and Oscar De La Hoya went to a Pittsburgh Steelers game on Sunday, where Joe lost his iPhone 5, so he borrowed Oscar's iPad";
preg_match_all("/[A-Z][a-z]*/",$string,$match_words); // incorrect expression
// desired result for $match_words should be:
// array(Joe ODonnell, Oscar De La Hoya, Pittsburgh Steelers, Sunday, Joe, iPhone 5, Oscars, iPad)
Thanks
Upvotes: 2
Views: 759
Reputation: 17205
In addition to Fede, Kelly and Daniel, 2 alternatives for accented languages
Using preg_split
$capitalized_words = preg_split("/ ([a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]| )+ /u", $string);
Using preg_match_all
//with 'u' flag
preg_match_all("/\b((?:[A-ZÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÆ]['a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]*\s*\d*)+)\b|\b((?:[a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]*[A-ZÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÆ]['a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]*\s*\d*)+)\b/u", $string, $capitalized_words);
Function using preg_match_all
together with trim
function get_capitalized_words($string){
$capitalized_words=array();
//with 'u' flag
preg_match_all("/\b((?:[A-ZÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÆ]['a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]*\s*\d*)+)\b|\b((?:[a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]*[A-ZÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÆ]['a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]*\s*\d*)+)\b/u", $string, $matches);
if(isset($matches[0])){
$capitalized_words=array_map('trim',$matches[0]);
}
return $capitalized_words;
}
Upvotes: 0
Reputation: 1339
You can make use of PHP's ctype_lower function here!
<?php
$string = "Joe O'Donnell and Oscar De La Hoya went to a Pittsburgh Steelers game on Sunday, where Joe lost his iPhone 5, so he borrowed Oscar's iPad";
$words = $temp = array();
// Loop through the string after turning it into an array (by spaces)
foreach (explode(" ", $string) as $word) {
// Check if the word is lowercase and is not a number
if (ctype_lower($word) && !is_numeric($word)) {
if (empty($temp)) continue; // Don't add it if there's nothing to add
// Add the words found up until this point (from the last point) into the words array, as a string
$words[] = implode(" ", $temp);
// Reset the temp array so we can look for new words and continue
$temp = array();
continue;
}
// Add this word to the words array
$temp[] = $word;
}
$words[] = implode(" ", $temp);
// Print the words that have uppercase characters
printf("<pre>%s</pre>", print_r($words, true));
Returns:
Array
(
[0] => Joe O'Donnell
[1] => Oscar De La Hoya
[2] => Pittsburgh Steelers
[3] => Sunday,
[4] => Joe
[5] => iPhone 5,
[6] => Oscar's iPad
)
Upvotes: 2
Reputation: 365
Adding to Fede's sweet answer, this would be your new PHP code:
$string = "Joe O'Donnell and Oscar De La Hoya went to a Pittsburgh Steelers game on Sunday, where Joe lost his iPhone 5, so he borrowed Oscar's iPad";
preg_match_all("/\b((?:[A-Z]['a-z]*\s*\d*)+)\b|\b((?:[a-z]*[A-Z]['a-z]*\s*\d*)+)\b/", $string, $matches);
print_r($matches[0]);
$matches[0] would be your array of matches.
Upvotes: 2
Reputation: 121
You could first remove all non-alphanumeric characters:
$string2 = preg_replace("/[^a-zA-Z0-9\s]/", "", $string);
Then use preg_split
instead of preg_replace
to split the string by sequences of entirely lower-case words.
$match_words = preg_split("/ ([a-z]| )+ /", $string2);
(If you don't mind $string
being destroyed, you can replace $string2
with $string
)
This works for the example you provided, but consider how you want your program to behave with less sanitised input. For instance, "Foo Bar"
(two spaces) would be split into two elements whereas "Foo Bar"
(one space) would remain as one. If you're not worried about speed, you could use another preg_replace
to replace any sequence of whitespace with a single space.
Upvotes: 3
Reputation: 31005
You could use a regex like this:
\b((?:[A-Z]['a-z]*\s*\d*)+)\b|\b((?:[a-z]*[A-Z]['a-z]*\s*\d*)+)\b
Match information:
MATCH 1
1. [0-14] `Joe O'Donnell `
MATCH 2
1. [18-35] `Oscar De La Hoya `
MATCH 3
1. [45-65] `Pittsburgh Steelers `
MATCH 4
1. [73-79] `Sunday`
MATCH 5
1. [87-91] `Joe `
MATCH 6
2. [100-108] `iPhone 5`
MATCH 7
1. [125-133] `Oscar's `
MATCH 8
2. [133-137] `iPad`
The regex consists of two patterns:
\b((?:[A-Z]['a-z]*\s*\d*)+)\b ---> Match words like Joe O'Connels or Oscar De La Hoya
|
\b((?:[a-z]*[A-Z]['a-z]*\s*\d*)+)\b ---> Match words like iPad or iPhone
Btw, if you take a look at the results, it has a trailing space at the end, you could do a trim to the result to have it cleaned.
Upvotes: 3