Matt
Matt

Reputation: 69

Regexp to read to a plus sign

I am using the below regexp successfully to read between my tags until I reach a case where there is a < sign embedded in my data between the tags. To fix this I want to read between a +> and a </+. There is no way that combination would be used in the database I'm pulling from. When I try to change the code below to do this I get stuck. Have any ideas?

Code:

@fieldValues =  $inFileLine =~ m(>([^<]+)<)g;

My sorry attempt at modifying the code:

@fieldValues =  $inFileLine =~ m(\+>([^<\/\+]+)<\/\+)g;

Data:

<+RecordID+>SWCR000111</+RecordID+><+Title+>My Title Is < Than Yours</+Title+>

Upvotes: 0

Views: 237

Answers (2)

user557597
user557597

Reputation:

update: Since you are just looking for simple, you don't have to
go beyond the definition of tag delimiters.
This is because you don't parse with a definition of a tag at all.

The solution boils down to this very simple regex -

Find: <(?!/?\+)
Replace: &lt;


If you want to proceed with a misconception that +> .. </+ delineates
something between tags, this is the original.


Typically it's done with negative assertions on a character by character basis.

m{\+>((?:(?!\+>|</\+).)*<(?:(?!\+>|</\+).)*)</\+}s

Formatted:

 \+>
 (                             # (1 start)
      (?:
           (?! \+> | </\+ )
           . 
      )*
      <
      (?:
           (?! \+> | </\+ )
           . 
      )*
 )                             # (1 end)
 </\+

Output:

 **  Grp 0 -  ( pos 42 , len 29 ) 
+>My Title Is < Than Yours</+  
 **  Grp 1 -  ( pos 44 , len 24 ) 
My Title Is < Than Yours  

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627100

Since it works for you as the +> cannot be followed with <+, I am posting my comment as an answer.

This regex should be safe to use even with very large files:

\+>(?!<\+)([^<]*(?:<(?!\/\+)[^<]*)*)<\/\+

See regex demo

Here is what it is doing:

  • \+>(?!<\+) - matches +> (with \+>) that is not followed with <+ (due to the negative lookahead (?!<\+))
  • ([^<]*(?:<(?!\/\+)[^<]*)*) - matches and stores in Group 1
    • [^<]* - 0 or more characters other than < followed by...
    • (?:<(?!\/\+)[^<]*)* - 0 or more sequences of...
      • <(?!\/\+) - < that is not followed by /+ and then
      • [^<]* - 0 or more characters other than <
  • <\/\+ - matches the final </+

In short, this is the same as \+>(?!<\+)([\s\S]*?)<\/\+, but "unwrapped" using the unrolling-the-loop technique to allow large portions of text in-between the delimiters (that is, between +> and the closest </+).

Upvotes: 1

Related Questions