Reputation: 2233
I need to extract all matches from a huge text that start with ["
and end with "]
. These special characters separate each record from database. I need to extract all records.
Inside this record there are letters, numbers and special characters like -
, .
, &
, ()
, /
, {space}
or so.
I'm writing this in Office VBA.
The pattern I have come so far looks like this: .Pattern = "[[][""][a-z|A-Z|w|W]*"
.
With this pattern, I am able to extract the first word from each record, with the starting characters ["
. The count of found matches is correct.
Example of one record:
["blabla","blabla","blabla","\u00e1no","nie","\u00e1no","\u00e1no","\u00e1no","\u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-pencil\u0022\u003E\u003C\/i\u003E Upravi\u0165\u003C\/a\u003E \u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;form\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-file-pdf-o\u0022\u003E\u003C\/i\u003E Zmluva\u003C\/a\u003E \u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;crz-form\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-file-pdf-o\u0022\u003E\u003C\/i\u003E Zmluva CRZ\u003C\/a\u003E"]
The question is : How can I extract the all records starting with ["
and ending with "]
?
I don't necessary need the starting and ending characters, but I can clean that up later.
Thanks for help.
Upvotes: 1
Views: 2191
Reputation: 626738
The easiest way is to get rid of the initial and trailing ["
and "]
with either Replace
or Left
/Right
/Mid
functions, and then Split
with ","
(in VBA, ""","""
).
E.g.
input = "YOUR_STRING"
input = Replace(Replace(input, """]", ""), "[""", "")
result = Split(input, """,""")
If you plan to use Regex, you can use \["[\s\S]*?"]
pattern, but it is not that efficient with long inputs and may even freeze the macro if timeout issue occurs. You can unroll it as
\["[^"]*(?:"(?!])[^"]*)*"]
See the regex demo. In VBA, Pattern = "\[""[^""]*(?:""(?!])[^""]*)*""]"
Note that with this unrolled pattern, you do not even need to use the workarounds for dot matching newline issue (negated character class [^"]
matches any char but "
, including a newline).
Pattern details:
\["
- ["
literally[^"]*
- zero or more characters other than "
(?:"(?!])[^"]*)*
- zero or more sequences of
"(?!])
- "
not followed with ]
[^"]*
- zero or more characters other than "
"]
- literal character sequence "]
Upvotes: 1