turbonerd
turbonerd

Reputation: 1306

Split a string by comma, quote and full-stop.. with a few exceptions

I've got a lot of text, similar to the following paragraph, which I'd like to split into words without punctuation (', ", ,, ., newline etc).. with a few exceptions.

Initially considered endemic to the Chalakudy River system in Kerala state, southern India, but now recognised to have a wider distribution in surrounding drainages including the Periyar, Manimala, and Pamba river though the Manimala data may be questionable given it seems to be the type locality of P. denisonii.

In the Achankovil River basin it occurs sympatrically, and sometimes syntopically, with P. denisonii.

Wild stocks may have dwindled by as much as 50% in the last 15 years or so with collection for the aquarium trade largely held responsible although habitats are also being degraded by pollution from agricultural and domestic sources, plus destructive fishing methods involving explosives or organic toxins.

The text refers to P. denisonii which is a species of fish. It's an abbreviation of Genus species. I would like this reference to be one word.

So, for instance, this is the kind of array I'd like to see:

Array
(
    ...
    [44] given
    [45] it
    [46] seems
    [47] to
    [48] be
    [49] the
    [50] type
    [51] locality
    [52] of
    [53] P. denisonii
    [54] In
    [55] the
    ...
)

The only things that distinguish these species references such as P. denisonii from a new sentence like end. New are:

What regexp can I use with preg_split to give me such an array? I've tried a simple explode( " ", $array ) but it doesn't do the job at all.

Thanks in advance,

Upvotes: 1

Views: 1368

Answers (1)

Cranio
Cranio

Reputation: 9847

Change your approach: why not use preg_match_all instead of preg_split? Instead of splitting the text with splitting delimiters, you'll match all the strings that do not contain the delimiters.

Use it with a regexp like: /([\S]+)|(P. denisonii)/ to match all the non-whitespace sequences AND sequence "P. denisonii"

To exclude comma, quote and full-stop and other characters just substitute \S with a negative regexp character list [^...]:

/([^\s,\.\"]+)|(P. denisonii)/ matches all the sequences that do not contain whitespace (\s), comma, quote and dot (\.)

Edit: to match a generic genus name (NOTE: I've altered your text to test the code better including quote and a bogus genus name)

$text = "Initially considered \"endemic\" to the Chalakudy River system in Kerala state, southern India, but now recognised to have a wider distribution in surrounding drainages including the Periyar, Manimala, and Pamba river though the Manimala data may be questionable given it seems to be the type locality of P. denisonii.

This is a bogus genus name, A. testii.

In the Achankovil River basin it occurs sympatrically, and sometimes syntopically, with P. denisonii.

Wild stocks may have dwindled by as much as 50% in the last 15 years or so with collection for the aquarium trade largely held responsible although habitats are also being degraded by pollution from agricultural and domestic sources, plus destructive fishing methods involving explosives or organic toxins.";


preg_match_all("/([A-Z]\. [a-z]+)|([^\s,\.\"]+)/", $text, $matches, PREG_PATTERN_ORDER);

echo "<pre>";
print_r($matches);

NOTE: the array you should pick is $matches[0], not $matches

Upvotes: 2

Related Questions