John
John

Reputation: 12193

preg_match to find words with capitals and successive capitalized words

I'm trying to match keywords in strings by filtering out only words that match the following criteria:

Example:

$string = "Joe O'Donnell and Oscar De La Hoya went to a Pittsburgh Steelers game on Sunday, where Joe lost his iPhone 5, so he borrowed Oscar's iPad";

preg_match_all("/[A-Z][a-z]*/",$string,$match_words); // incorrect expression

// desired result for $match_words should be: 
// array(Joe ODonnell, Oscar De La Hoya, Pittsburgh Steelers, Sunday, Joe, iPhone 5, Oscars, iPad)

Thanks

Upvotes: 2

Views: 759

Answers (5)

RafaSashi
RafaSashi

Reputation: 17205

In addition to Fede, Kelly and Daniel, 2 alternatives for accented languages

Using preg_split

$capitalized_words = preg_split("/ ([a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]| )+ /u", $string);

Using preg_match_all

//with 'u' flag 
preg_match_all("/\b((?:[A-ZÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÆ]['a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]*\s*\d*)+)\b|\b((?:[a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]*[A-ZÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÆ]['a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]*\s*\d*)+)\b/u", $string, $capitalized_words);

Function using preg_match_all together with trim

function get_capitalized_words($string){
    $capitalized_words=array();

    //with 'u' flag 
    preg_match_all("/\b((?:[A-ZÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÆ]['a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]*\s*\d*)+)\b|\b((?:[a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]*[A-ZÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÆ]['a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]*\s*\d*)+)\b/u", $string, $matches);

    if(isset($matches[0])){
        $capitalized_words=array_map('trim',$matches[0]);
    }

    return $capitalized_words;
}

Upvotes: 0

helllomatt
helllomatt

Reputation: 1339

You can make use of PHP's ctype_lower function here!

<?php

$string = "Joe O'Donnell and Oscar De La Hoya went to a Pittsburgh Steelers game on Sunday, where Joe lost his iPhone 5, so he borrowed Oscar's iPad";

$words = $temp = array();

// Loop through the string after turning it into an array (by spaces)
foreach (explode(" ", $string) as $word) {
    // Check if the word is lowercase and is not a number
    if (ctype_lower($word) && !is_numeric($word)) {
        if (empty($temp)) continue; // Don't add it if there's nothing to add

        // Add the words found up until this point (from the last point) into the words array, as a string
        $words[] = implode(" ", $temp);

        // Reset the temp array so we can look for new words and continue
        $temp = array();
        continue;
    }

    // Add this word to the words array
    $temp[] = $word;
}

$words[] = implode(" ", $temp);

// Print the words that have uppercase characters
printf("<pre>%s</pre>", print_r($words, true));

Returns:

Array
(
    [0] => Joe O'Donnell
    [1] => Oscar De La Hoya
    [2] => Pittsburgh Steelers
    [3] => Sunday,
    [4] => Joe
    [5] => iPhone 5,
    [6] => Oscar's iPad
)

Upvotes: 2

Kelly Kiernan
Kelly Kiernan

Reputation: 365

Adding to Fede's sweet answer, this would be your new PHP code:

$string = "Joe O'Donnell and Oscar De La Hoya went to a Pittsburgh Steelers game on Sunday, where Joe lost his iPhone 5, so he borrowed Oscar's iPad";

preg_match_all("/\b((?:[A-Z]['a-z]*\s*\d*)+)\b|\b((?:[a-z]*[A-Z]['a-z]*\s*\d*)+)\b/", $string, $matches);

print_r($matches[0]);

$matches[0] would be your array of matches.

Upvotes: 2

Daniel
Daniel

Reputation: 121

You could first remove all non-alphanumeric characters:

$string2 = preg_replace("/[^a-zA-Z0-9\s]/", "", $string);

Then use preg_split instead of preg_replace to split the string by sequences of entirely lower-case words.

 $match_words = preg_split("/ ([a-z]| )+ /", $string2);

(If you don't mind $string being destroyed, you can replace $string2 with $string)

This works for the example you provided, but consider how you want your program to behave with less sanitised input. For instance, "Foo Bar" (two spaces) would be split into two elements whereas "Foo Bar" (one space) would remain as one. If you're not worried about speed, you could use another preg_replace to replace any sequence of whitespace with a single space.

Upvotes: 3

Federico Piazza
Federico Piazza

Reputation: 31005

You could use a regex like this:

\b((?:[A-Z]['a-z]*\s*\d*)+)\b|\b((?:[a-z]*[A-Z]['a-z]*\s*\d*)+)\b

Working demo

enter image description here

Match information:

MATCH 1
1.  [0-14]  `Joe O'Donnell `
MATCH 2
1.  [18-35] `Oscar De La Hoya `
MATCH 3
1.  [45-65] `Pittsburgh Steelers `
MATCH 4
1.  [73-79] `Sunday`
MATCH 5
1.  [87-91] `Joe `
MATCH 6
2.  [100-108]   `iPhone 5`
MATCH 7
1.  [125-133]   `Oscar's `
MATCH 8
2.  [133-137]   `iPad`

The regex consists of two patterns:

\b((?:[A-Z]['a-z]*\s*\d*)+)\b       ---> Match words like Joe O'Connels or Oscar De La Hoya
|
\b((?:[a-z]*[A-Z]['a-z]*\s*\d*)+)\b ---> Match words like iPad or iPhone

Btw, if you take a look at the results, it has a trailing space at the end, you could do a trim to the result to have it cleaned.

Upvotes: 3

Related Questions