Reputation: 69
I am using the below regexp successfully to read between my tags until I reach a case where there is a <
sign embedded in my data between the tags. To fix this I want to read between a +>
and a </+
. There is no way that combination would be used in the database I'm pulling from. When I try to change the code below to do this I get stuck. Have any ideas?
Code:
@fieldValues = $inFileLine =~ m(>([^<]+)<)g;
My sorry attempt at modifying the code:
@fieldValues = $inFileLine =~ m(\+>([^<\/\+]+)<\/\+)g;
Data:
<+RecordID+>SWCR000111</+RecordID+><+Title+>My Title Is < Than Yours</+Title+>
Upvotes: 0
Views: 237
Reputation:
update: Since you are just looking for simple, you don't have to
go beyond the definition of tag delimiters.
This is because you don't parse with a definition of a tag at all.
The solution boils down to this very simple regex -
Find: <(?!/?\+)
Replace: <
If you want to proceed with a misconception that +> .. </+
delineates
something between tags, this is the original.
Typically it's done with negative assertions on a character by character basis.
m{\+>((?:(?!\+>|</\+).)*<(?:(?!\+>|</\+).)*)</\+}s
Formatted:
\+>
( # (1 start)
(?:
(?! \+> | </\+ )
.
)*
<
(?:
(?! \+> | </\+ )
.
)*
) # (1 end)
</\+
Output:
** Grp 0 - ( pos 42 , len 29 )
+>My Title Is < Than Yours</+
** Grp 1 - ( pos 44 , len 24 )
My Title Is < Than Yours
Upvotes: 0
Reputation: 627100
Since it works for you as the +>
cannot be followed with <+
, I am posting my comment as an answer.
This regex should be safe to use even with very large files:
\+>(?!<\+)([^<]*(?:<(?!\/\+)[^<]*)*)<\/\+
See regex demo
Here is what it is doing:
\+>(?!<\+)
- matches +>
(with \+>
) that is not followed with <+
(due to the negative lookahead (?!<\+)
)([^<]*(?:<(?!\/\+)[^<]*)*)
- matches and stores in Group 1
[^<]*
- 0 or more characters other than <
followed by...(?:<(?!\/\+)[^<]*)*
- 0 or more sequences of...
<(?!\/\+)
- <
that is not followed by /+
and then[^<]*
- 0 or more characters other than <
<\/\+
- matches the final </+
In short, this is the same as \+>(?!<\+)([\s\S]*?)<\/\+
, but "unwrapped" using the unrolling-the-loop technique to allow large portions of text in-between the delimiters (that is, between +>
and the closest </+
).
Upvotes: 1