mandel
mandel

Reputation: 181

php regex - extract all text before certain characters

I am trying to extract publisher information from a string. It comes in various formats such as:

John Wiley & Sons (1995), Paperback, 154 pages

New York, Crowell [1963] viii, 373 p. illus. 20 cm.

New York: Bantam Books, c1990. xx, 444 p. : ill. ; 27 cm.

Garden City, N.Y., Doubleday, 1963. 142 p. illus. 22 cm. [1st ed.]

All I want to extract is the publisher name, so everything after the ( or the [ can be ignored. I'd need to grab any character before this, however. And it's complicated by the fact that for example three, I'd want to grab the information before the comma, but in example two, I'd want to grab the information before the square bracket only and keep that comma if possible.

I'm willing to work with a regex that takes everything before ( [ and , and work with any imperfect data (like only getting "New York" for example 2), since I wouldn't want to insert all of example 3 into the database. The majority of the data have the date in brackets as in examples 1 and 2.

Thanks in advance for any suggestions!

Upvotes: 0

Views: 2203

Answers (2)

Tomalak
Tomalak

Reputation: 338148

Hm how about replacing:

[^\w\n\r]+c?[12]\d{3}.*

with the empty string? Explanation:

[^\w\n\r]+   # any non-word character (but no new lines either!)
c?           # an optional "c"
[12]\d{3}    # a year (probably, at least)
.*           # all the rest of the line

Works for your example, might need a little extra tweaking.

Upvotes: 2

Aillyn
Aillyn

Reputation: 23783

Here is one: #(.+?)\W*.\d{4}#:

preg_match_all('#(.+?)\W*.\d{4}#', $books, $matches);
$publishers = array_map('trim', $matches[1]);

print_r($publishers);

Generates (as seen on ideone):

Array
(
    [0] => John Wiley & Sons
    [1] => New York, Crowell
    [2] => New York: Bantam Books
    [3] => Garden City, N.Y., Doubleday
)

It basically extracts everything before the sequence [any number non-word characters + 1 character + 4 digit string (hopefully the year)].

Upvotes: 1

Related Questions