user1307016
user1307016

Reputation: 405

PHP preg-match-all capture

I would like to capture each of these in their own group with preg_match_all in PHP:

  1. The chapter, section, or page
  2. The number (or letter if it has one) of the specified chapter, section, or page. If there is a single space between them it should be taken into account
  3. The words "and", "or"

Keeping in mind that I want to ignore all book titles and the number of items in the string may be dynamic, the regex should work on all the examples below:

  1. Ch1 and Sect2b
  2. Ch 4 x unwantedtitle and Sect 5y unwanted title and Sect6 z and Ch7 or Ch8

This is what I managed to come up with so far:

    $str = 'Ch 1 a unwantedtitle and Sect 2b unwanted title and Pg3';
    preg_match_all ('/([a-z]+)(?=\d|\d\s)\s*(\d*)\s*(?<=\d|\d\s)([a-z]?).*?(and|or)?/i', $str, $matches);

    Array
    (
        [0] => Array
            (
                [0] => Pg3
            )

        [1] => Array
            (
                [0] => Pg
            )

        [2] => Array
            (
                [0] => 3
            )

        [3] => Array
            (
                [0] => 
            )

        [4] => Array
            (
                [0] => 
            )

    )

The expected result should be:

    Array
    (
        [0] => Array
            (
                [0] => Ch 1 a and 
                [1] => Sect 2b and 
                [2] => Pg3
            )

        [1] => Array
            (
                [0] => Ch
                [1] => Sect
                [2] => Pg
            )

        [2] => Array
            (
                [0] => 1
                [1] => 2
                [2] => 3
            )

        [3] => Array
            (
                [0] => a
                [1] => b
                [2] => 
            )

        [4] => Array
            (
                [0] => and
                [1] => and
                [2] => 
            )

    )

Upvotes: 0

Views: 292

Answers (2)

inhan
inhan

Reputation: 7470

This is how I would do it.

$arr = array(
    'Ch1 and Sect2b',
    'Ch 1 a unwantedtitle and Sect 2b unwanted title and Pg3',
    'Ch 4 x unwantedtitle and Sect 5y unwanted title and' .
        ' Sect6 z and Ch7 or Ch8a',
    'Assume this is ch1a and ch 2 or ch seCt 5c.' .
        ' Then SECT or chA pg22a and pg 13 andor'
);

foreach ($arr as $a) {
    var_dump($a);
    preg_match_all(
    '~
        \b(?P<word>ch|sect|(pg))
        \s*(?P<number>\d+)
        (?(2)\b|
            \s*
            (?P<letter>(?!(?<=\s)(?:and|or)\b)[a-z]+)?
            \s*
            (?:(?<=\s)(?P<cond>and|or)\b)?
        )
    ~xi'
    ,$a,$m);
    foreach ($m as $k => $v) {
        if (is_numeric($k) && $k !== 0) unset($m[$k]);
        // this is for 'beautifying' the result array
        // note that $m[0] will still return whole matches
    }
    print_r($m);
}

I had to turn pg into a capturing group because I needed to write a condition explicitly for that, which is, it can be appended a number (with or without spaces in between) but it can not be appended any letters considering a page indicator will not have a letter like in "pg23a".

That's why I chose to name each group and "beautify" the result by the inner foreach loop in the code. Otherwise if you choose to use numeric indexes (instead of named ones) you will need to skip each $m[2].

To display an example here's the output of the last item in $arr.

Array
(
    [0] => Array
        (
            [0] => ch1a and
            [1] => ch 2 or
            [2] => seCt 5c
            [3] => pg 13
        )

    [word] => Array
        (
            [0] => ch
            [1] => ch
            [2] => seCt
            [3] => pg
        )

    [number] => Array
        (
            [0] => 1
            [1] => 2
            [2] => 5
            [3] => 13
        )

    [letter] => Array
        (
            [0] => a
            [1] => 
            [2] => c
            [3] => 
        )

    [cond] => Array
        (
            [0] => and
            [1] => or
            [2] => 
            [3] => 
        )

)

Upvotes: 0

Westy92
Westy92

Reputation: 21325

This is the closest I could get:

$str = 'Ch 1 a unwantedtitle and Sect 2b unwanted title and Pg3';
preg_match_all ('/((Ch|Sect|Pg)\s?(\d+)\s?(\w?))(.*?(and|or))?/i', $str, $matches);


Array
(
    [0] => Array
        (
            [0] => Ch 1 a unwantedtitle and
            [1] => Sect 2b unwanted title and
            [2] => Pg3
        )

    [1] => Array
        (
            [0] => Ch 1 a
            [1] => Sect 2b
            [2] => Pg3
        )

    [2] => Array
        (
            [0] => Ch
            [1] => Sect
            [2] => Pg
        )

    [3] => Array
        (
            [0] => 1
            [1] => 2
            [2] => 3
        )

    [4] => Array
        (
            [0] => a
            [1] => b
            [2] => 
        )

    [5] => Array
        (
            [0] =>  unwantedtitle and
            [1] =>  unwanted title and
            [2] => 
        )

    [6] => Array
        (
            [0] => and
            [1] => and
            [2] => 
        )

)

Upvotes: 0

Related Questions