kolcinx
kolcinx

Reputation: 2233

Regular Expression starting and ending with special characters

I need to extract all matches from a huge text that start with [" and end with "]. These special characters separate each record from database. I need to extract all records.

Inside this record there are letters, numbers and special characters like -, ., &, (), /, {space} or so.
I'm writing this in Office VBA.
The pattern I have come so far looks like this: .Pattern = "[[][""][a-z|A-Z|w|W]*".
With this pattern, I am able to extract the first word from each record, with the starting characters [". The count of found matches is correct.
Example of one record: ["blabla","blabla","blabla","\u00e1no","nie","\u00e1no","\u00e1no","\u00e1no","\u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-pencil\u0022\u003E\u003C\/i\u003E Upravi\u0165\u003C\/a\u003E \u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;form\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-file-pdf-o\u0022\u003E\u003C\/i\u003E Zmluva\u003C\/a\u003E \u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;crz-form\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-file-pdf-o\u0022\u003E\u003C\/i\u003E Zmluva CRZ\u003C\/a\u003E"]

The question is : How can I extract the all records starting with [" and ending with "]?
I don't necessary need the starting and ending characters, but I can clean that up later.
Thanks for help.

Upvotes: 1

Views: 2191

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

The easiest way is to get rid of the initial and trailing [" and "] with either Replace or Left/Right/Mid functions, and then Split with "," (in VBA, """,""").

E.g.

input = "YOUR_STRING"
input = Replace(Replace(input, """]", ""), "[""", "")
result = Split(input, """,""")

If you plan to use Regex, you can use \["[\s\S]*?"] pattern, but it is not that efficient with long inputs and may even freeze the macro if timeout issue occurs. You can unroll it as

\["[^"]*(?:"(?!])[^"]*)*"]

See the regex demo. In VBA, Pattern = "\[""[^""]*(?:""(?!])[^""]*)*""]"

Note that with this unrolled pattern, you do not even need to use the workarounds for dot matching newline issue (negated character class [^"] matches any char but ", including a newline).

Pattern details:

  • \[" - [" literally
  • [^"]* - zero or more characters other than "
  • (?:"(?!])[^"]*)* - zero or more sequences of
    • "(?!]) - " not followed with ]
    • [^"]* - zero or more characters other than "
  • "] - literal character sequence "]

Upvotes: 1

Related Questions