Reputation: 181
I am trying to extract publisher information from a string. It comes in various formats such as:
John Wiley & Sons (1995), Paperback, 154 pages
New York, Crowell [1963] viii, 373 p. illus. 20 cm.
New York: Bantam Books, c1990. xx, 444 p. : ill. ; 27 cm.
Garden City, N.Y., Doubleday, 1963. 142 p. illus. 22 cm. [1st ed.]
All I want to extract is the publisher name, so everything after the ( or the [ can be ignored. I'd need to grab any character before this, however. And it's complicated by the fact that for example three, I'd want to grab the information before the comma, but in example two, I'd want to grab the information before the square bracket only and keep that comma if possible.
I'm willing to work with a regex that takes everything before ( [ and , and work with any imperfect data (like only getting "New York" for example 2), since I wouldn't want to insert all of example 3 into the database. The majority of the data have the date in brackets as in examples 1 and 2.
Thanks in advance for any suggestions!
Upvotes: 0
Views: 2203
Reputation: 338148
Hm how about replacing:
[^\w\n\r]+c?[12]\d{3}.*
with the empty string? Explanation:
[^\w\n\r]+ # any non-word character (but no new lines either!)
c? # an optional "c"
[12]\d{3} # a year (probably, at least)
.* # all the rest of the line
Works for your example, might need a little extra tweaking.
Upvotes: 2
Reputation: 23783
Here is one: #(.+?)\W*.\d{4}#
:
preg_match_all('#(.+?)\W*.\d{4}#', $books, $matches);
$publishers = array_map('trim', $matches[1]);
print_r($publishers);
Generates (as seen on ideone):
Array
(
[0] => John Wiley & Sons
[1] => New York, Crowell
[2] => New York: Bantam Books
[3] => Garden City, N.Y., Doubleday
)
It basically extracts everything before the sequence [any number non-word characters + 1 character + 4 digit string (hopefully the year)].
Upvotes: 1