Miroslav
Miroslav

Reputation: 4695

Regex parsing of ASP.NET tags

I have a program to parse various file formats with a goal to find localizable strings (GetText pretty much). I'm looking for a regex that would get "TEXT TO TRANSLATE" from within specific opening and closing tag. I had a working regex but the following example broke it, thanks to the IsVisible call.

<mw:Translate runat="server" Visible='<%# IsVisible() %>'>
TEXT TO TRANSLATE
</mw:Translate>

This is what I have so far but got stuck with it...any help? I have described my wrongly regexxed intentions in //comments...

(?s)                   //multiline flag

\<mw\:Translate        //opening <mw:Translate> tag

(?:(?![^"']+\s*\>)+)   //match anything but > preceeded by " or ' 
                       //with any whitespace after it
(?:["']+\s*)\>         //match > preceeded by " or ' with any 
                       //whitespace after it

\s*                    //match any whitespace 
                       //(for trimming any whitespace around the text)
(?<text>.*?)           //capturing group for the localizable text
\s*                    //match any whitespace 

\</mw\:Translate\>     //match closing tag

The problem I have is probably in the opening tag...I'm trying to match the closing bracket > only when it is preceeded by " or ' with no or any whitespace after that...because otherwise it's either something like %> or it's not a valid ASP.NET

EDIT 1: Please read the question before coming to conclusions. This is not HTML but ASP.NET which cannot be possibly parsed well with any HTML parsers. I'm also targeting something very specific. Correction: people seem to agree it can be parsed with HtmlAgility pack but I don't really want to use it, because I don't really like to rely on an external lib for one simple use case.

EDIT 2: mw:Translate cannot be nested. It simply won't compile because of how the mw:Translate is programmed.

EDIT 3: Clarification of edits.

EDIT 4: Self closing mw:translate is not permitted

EDIT 5: HTML inside mw:Translate is as valid as any other text on ASP.NET page

EDIT 6: answered myself, the regex I'd need may be a bit more complicated (but not because of any relation with HTML), see below

Upvotes: 1

Views: 673

Answers (3)

Anirudha
Anirudha

Reputation: 32807

Even if you modifiy your regex.Here are some problems

  • wont work if there are other tags inside(next to impossible to solve this problem with regex)
  • asp.net can have self closing tags like <a href=''/>

Use htmlagilitypack

You can use this code to retrieve it using HtmlAgilityPack

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

var itemList = doc.DocumentNode.SelectNodes("//Translate")//this xpath selects all translate tag
                  .Select(p => p.InnerText)
                  .ToList();

//itemList now contain all the translate tags content

Upvotes: 1

Tim Schmelter
Tim Schmelter

Reputation: 460208

Even if this is ASP.NET and not HTML you can use HtmlAgilityPack to parse it.

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html); // html is the aspx document text
var translatableTextNodes = 
    doc.DocumentNode.SelectNodes("//text()[contains(., 'TEXT TO TRANSLATE')]");
foreach (var parent in translatableTextNodes)
    Console.WriteLine("Node:[{0}] Text:{1}",parent.Name, parent.InnerText);

Output with a sample page containing one of your server control containing TEXT TO TRANSLATE:

Node:[mw:translate] Text:
TEXT TO TRANSLATE

Upvotes: 3

Kobi
Kobi

Reputation: 138097

I'd try matching the list of attributes, assuming an attribute is wrapped in quotes or single quotes.
This is an assumption that isn't correct for all HTML, but it may work for you:

<mw:Translate       #opening <mw:Translate> tag
# Match attributes
(?:\s+\w+(?:\s*=\s*(?:"[^"]*"|'[^']*'))?)*
\s*
>                   #match >
\s*
(?<text>.*?)        #capturing group for the localizable text
\s*                 #match any whitespace 
</mw:Translate>     #match closing tag

Working example: http://regexhero.net/tester/?id=5834b4f1-095b-4af6-a0da-d1fe119778bc

Upvotes: 0

Related Questions