Reputation: 4695
I have a program to parse various file formats with a goal to find localizable strings (GetText pretty much). I'm looking for a regex that would get "TEXT TO TRANSLATE" from within specific opening and closing tag. I had a working regex but the following example broke it, thanks to the IsVisible call.
<mw:Translate runat="server" Visible='<%# IsVisible() %>'>
TEXT TO TRANSLATE
</mw:Translate>
This is what I have so far but got stuck with it...any help? I have described my wrongly regexxed intentions in //comments...
(?s) //multiline flag
\<mw\:Translate //opening <mw:Translate> tag
(?:(?![^"']+\s*\>)+) //match anything but > preceeded by " or '
//with any whitespace after it
(?:["']+\s*)\> //match > preceeded by " or ' with any
//whitespace after it
\s* //match any whitespace
//(for trimming any whitespace around the text)
(?<text>.*?) //capturing group for the localizable text
\s* //match any whitespace
\</mw\:Translate\> //match closing tag
The problem I have is probably in the opening tag...I'm trying to match the closing bracket > only when it is preceeded by " or ' with no or any whitespace after that...because otherwise it's either something like %> or it's not a valid ASP.NET
EDIT 1: Please read the question before coming to conclusions. This is not HTML but ASP.NET which cannot be possibly parsed well with any HTML parsers. I'm also targeting something very specific. Correction: people seem to agree it can be parsed with HtmlAgility pack but I don't really want to use it, because I don't really like to rely on an external lib for one simple use case.
EDIT 2: mw:Translate cannot be nested. It simply won't compile because of how the mw:Translate is programmed.
EDIT 3: Clarification of edits.
EDIT 4: Self closing mw:translate is not permitted
EDIT 5: HTML inside mw:Translate is as valid as any other text on ASP.NET page
EDIT 6: answered myself, the regex I'd need may be a bit more complicated (but not because of any relation with HTML), see below
Upvotes: 1
Views: 673
Reputation: 32807
Even if you modifiy your regex.Here are some problems
<a href=''/>
Use htmlagilitypack
You can use this code to retrieve it using HtmlAgilityPack
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
var itemList = doc.DocumentNode.SelectNodes("//Translate")//this xpath selects all translate tag
.Select(p => p.InnerText)
.ToList();
//itemList now contain all the translate tags content
Upvotes: 1
Reputation: 460208
Even if this is ASP.NET and not HTML you can use HtmlAgilityPack
to parse it.
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html); // html is the aspx document text
var translatableTextNodes =
doc.DocumentNode.SelectNodes("//text()[contains(., 'TEXT TO TRANSLATE')]");
foreach (var parent in translatableTextNodes)
Console.WriteLine("Node:[{0}] Text:{1}",parent.Name, parent.InnerText);
Output with a sample page containing one of your server control containing TEXT TO TRANSLATE
:
Node:[mw:translate] Text:
TEXT TO TRANSLATE
Upvotes: 3
Reputation: 138097
I'd try matching the list of attributes, assuming an attribute is wrapped in quotes or single quotes.
This is an assumption that isn't correct for all HTML, but it may work for you:
<mw:Translate #opening <mw:Translate> tag
# Match attributes
(?:\s+\w+(?:\s*=\s*(?:"[^"]*"|'[^']*'))?)*
\s*
> #match >
\s*
(?<text>.*?) #capturing group for the localizable text
\s* #match any whitespace
</mw:Translate> #match closing tag
Working example: http://regexhero.net/tester/?id=5834b4f1-095b-4af6-a0da-d1fe119778bc
Upvotes: 0