Reputation: 68
I have this java string with xml info and I am trying to use java regex to filter out all the junk that is between the words to form a word enclosed in brackets, e.g. [DEFENDANT].
I want to go from this:
<w:p><w:r><w:t>[</w:t></w:r><st1:PlaceName w:st="on"><w:r><w:t>DEFENDANT</w:t></w:r>
</st1:PlaceName><w:r><w:t> </w:t></w:r><st1:PlaceType w:st="on"><w:r><w:t>CITY</w:t></w:r>
</st1:PlaceType><w:r><w:t>], [</w:t></w:r><st1:place w:st="on"><st1:PlaceName w:st="on"><w:r>
<w:t>DEFENDANT</w:t></w:r></st1:PlaceName><w:r><w:t> </w:t></w:r><st1:PlaceType w:st="on"><w:r>
<w:t>STATE</w:t></w:r></st1:PlaceType></st1:place><w:r><w:t>] [DEFENDANT ZIP]</w:r><w:r>
to this:
<w:p><w:r><w:t>[DEFENDANT CITY], [DEFENDANT STATE] [DEFENDANT ZIP]</w:r><w:r>
I have been testing with regex epression like (\[)<.+>+([A-Z ]+\])
on regexPlanet extensively to no avail.
Upvotes: 0
Views: 467
Reputation: 324
If it's all on a single line, like this:
<w:p><w:r><w:t>[</w:t></w:r><st1:PlaceName w:st="on"><w:r><w:t>DEFENDANT</w:t></w:r></st1:PlaceName><w:r><w:t> </w:t></w:r><st1:PlaceType w:st="on"><w:r><w:t>CITY</w:t></w:r></st1:PlaceType><w:r><w:t>], [</w:t></w:r><st1:place w:st="on"><st1:PlaceName w:st="on"><w:r><w:t>DEFENDANT</w:t></w:r></st1:PlaceName><w:r><w:t> </w:t></w:r><st1:PlaceType w:st="on"><w:r><w:t>STATE</w:t></w:r></st1:PlaceType></st1:place><w:r><w:t>] [DEFENDANT ZIP]</w:r><w:r>
Then this regex should work:
([<\w:\w>]+)(\[[</\w:\w>]+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+><\w:\w><\w:\w>\s</\w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+><\w:\w><\w:\w>\],\s\[</\w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w+:\w+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+><\w:\w><\w:\w>\s</w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+></\w+:\w+><\w:\w><\w:\w>\]\s\[)(\w+\s\w+)(\])(</\w:\w><\w:\w>)
I have a working example here: RegExr
I could have grouped things a little better, but overall, it gets the job done, so you should be able to see it working.
Also, if it's not on a single line (if it's like it is in your example), then this would work:
([<\w:\w>]+)(\[[</\w:\w>]+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w>\s+</\w+:\w+><\w:\w><\w:\w>\s</\w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w>\s+</\w+:\w+><\w:\w><\w:\w>\],\s\[</\w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w+:\w+\s\w:\w+="\w+"><\w:\w>\s+<\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+><\w:\w><\w:\w>\s</w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w:\w>\s+<\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+></\w+:\w+><\w:\w><\w:\w>\]\s\[)(\w+\s\w+)(\])(</\w:\w><\w:\w>)
You can see that on RegExr here.
Upvotes: 0
Reputation: 91871
Do not use Regex to parse XML. Just use the built in Java XML library.
Upvotes: 4