php regex - extract all text before certain characters

Question

I am trying to extract publisher information from a string. It comes in various formats such as:

John Wiley & Sons (1995), Paperback, 154 pages

New York, Crowell [1963] viii, 373 p. illus. 20 cm.

New York: Bantam Books, c1990. xx, 444 p. : ill. ; 27 cm.

Garden City, N.Y., Doubleday, 1963. 142 p. illus. 22 cm. [1st ed.]

All I want to extract is the publisher name, so everything after the ( or the [ can be ignored. I'd need to grab any character before this, however. And it's complicated by the fact that for example three, I'd want to grab the information before the comma, but in example two, I'd want to grab the information before the square bracket only and keep that comma if possible.

I'm willing to work with a regex that takes everything before ( [ and , and work with any imperfect data (like only getting "New York" for example 2), since I wouldn't want to insert all of example 3 into the database. The majority of the data have the date in brackets as in examples 1 and 2.

Thanks in advance for any suggestions!

Tomalak · Accepted Answer

Hm how about replacing:

[^\w

]+c?[12]\d{3}.*

with the empty string? Explanation:

[^\w

]+   # any non-word character (but no new lines either!)
c?           # an optional "c"
[12]\d{3}    # a year (probably, at least)
.*           # all the rest of the line

Works for your example, might need a little extra tweaking.

php regex - extract all text before certain characters

Answers (2)

Related Questions