Reputation: 2419
I have to extract 3 things from a string, PO box, street address and everything else.
Here's how the string would look:
DUNHOUR AGENCY INC PO BOX 48 44 TANNER STREET HADDONFIELD NJ 08033 VERONA NJ 070440324
I have managed to extract PO and street address using the following regex but have been running in circles to get the remaining part of the string.
Here's what my regex is;
\b(PO BOX \d{2,5}|PO Box \d{2,5}|P.O. BOX \d{2,5}|P O BOX \d{2,5})?\s*(\d+\s[A-z]+\s[A-z]+)\s(\d+\s[A-z]+)?
How can I get everything else as last group match?
I should also be able to extract rest of the data if PO box information is missing that is
*BENNETTI-HOLMES INSURANCE 43 VOSHELL MILL ROAD DOVER DE 19904
I should get false for PO, get the street address and everything else in last group match.
Upvotes: 0
Views: 59
Reputation: 163362
A few minor notes about the pattern in your posted answer.
[A-z]
matches more than [A-Za-z]
(\sROAD|STREET|AVENUE|DRIVE|RD|ST|AV|DR)?
so your pattern will also match if you leave it out\sROAD
the whitespace char will match only before ROAD and will not apply to the other alternatives\s
might also match a newline and escape the dot \.
to match it literallyYou might update the pattern to
\b((?:P ?O|P\.O\.) B(?:ox|OX)\s*\d{2,5})?\s*(\d+\s[A-Za-z]+(?:\s[A-Za-z]+)*\s(?:ROAD|STREET|AVENUE|DRIVE|RD|ST|AV|DR))\s(.{0,100})
In separate parts:
\b
Word boundary(
Capture group 1
(?:P ?O|P\.O\.)
Match the variants of PO P O or P.O.B(?:ox|OX)
Match either Box or BOX\s*\d{2,5}
Match 0+ whitespace chars and 2-5 digits)?
Close group 1 and make it optional\s*
Match 0+ whitespace chars(
Capture group 2
\d+\s[A-Za-z]+
Match 1+ digits and 1+ chars A-Za-z(?:\s[A-Za-z]+)*
Repeat 0+ times matching a whitespace char and 1+ times A-Za-z\s(?:ROAD|STREET|AVENUE|DRIVE|RD|ST|AV|DR)
Match a whitespace char and one of the alternatives)
Close group 2\s
Match a whitespace char(.{0,100})
Capture group 3, match any char except a newline 0-100 timesUpvotes: 2
Reputation: 2419
Finally managed to get it done using
\b(PO BOX \d{2,5}|PO Box \d{2,5}|P.O. BOX \d{2,5}|P O BOX \d{2,5})?\s*(\d+\s[A-z]+\s[A-z]+\b(\sROAD|STREET|AVENUE|DRIVE|RD|ST|AV|DR)?)\s+(.{0,100})
Upvotes: 0