Patrick Nogueira
Patrick Nogueira

Reputation: 184

How do I get the Named Groups of a Regex in Delphi?

I'm trying to use regex on Delphi to regex a HTML and get some data.

My objective is create a query string with the follow sintax:

?namedGroup1=valueNamedGroup1&namedGroup2=valueNamedGroup2

I have n Array of regex:

array[0] = '<div (id="(?<id>[a-zA-Z0-9]+)"|name="(?<name>[a-zA-Z0-9]+))"';

My html:

<h1>bla bla bla</h1> <div id="home">

If I apply this regex using the built in regex in PHP it will return an associative array

RegArray[0] = '<div id="home">'
RegArray['id'] = 'home'

if I do a foreach I easily get the list of the named groups and I can create my querystring:

?id=home

The thing is that I don't know if the regex will match the named group ID or Name and I need to know that.

Delphi only return a simple array

RegArray[0] = '<div id="home">'
RegArray[1] = 'home'  // ID or NAME?

So, how do I get the named Group and the named Group Value?

here it is my code:

var RegEx: TRegEx;
begin
 RegEx := TRegEx.Create(array[0], [roIgnoreCase,roMultiline]);
 Match := RegEx.Match(html);
 if (Match.Success) then
 begin
   //get the group here.
 end;

I also tried this class: http://www.regular-expressions.info/delphi.html

But no success

Upvotes: 4

Views: 3185

Answers (3)

maf-soft
maf-soft

Reputation: 2552

TRegEx (from System.RegularExpressions) is a wrapper around TPerlRegEx (from System.RegularExpressionsCore), which is a wrapper around the open source PCRE library.

PCRE of course supports retrieving the names for groups, but both wrappers don't.

Possible solutions:

  • Ask Embarcadero to fix it
  • Access PCRE directly (System.RegularExpressionsAPI)
  • Use one of the two wrappers, but for retrieving the names, hack into their private members to get access to the PCRE memory (pcre_fullinfo(TPerlRegEx.FPattern, ...))
  • Use a better wrapper, i.e. JclPCRE from the open source JEDI Code Library (JCL): Name1:= TJclRegEx.CaptureNames[1];

Upvotes: 1

Arioch &#39;The
Arioch &#39;The

Reputation: 16065

I think you made a mistake in your query: look at the last two characters of the pattern - it clearly was unbalanced! Looks like you failed to copy-paste from PHP ;-)

  • yours: <div (id="(?<id>[a-zA-Z0-9]+)"|name="(?<name>[a-zA-Z0-9]+))"
  • mine: <div (id="(?<id>[a-zA-Z0-9]+)"|name="(?<name>[a-zA-Z0-9]+)")

DI RegExp demo

Using pcre.org engine + interactive editor from http://www.yunqa.de/delphi/doku.php/products/regex/index


I also tried this class: http://www.regular-expressions.info/delphi.html

That page immediately shows another interactive editor that could be used to debug your RegEx program: http://www.regexbuddy.com/test.html

I wonder why didn't you tried to use it...


Still i think some HTML parser would be both faster and more reliable. Consider HTML extracts like

 <!-- <p><div name="bla-bla"> ... </div></p> -->

or like

 <img src="...." alt='Press to insert <div id="123"> to you sample text' />

or like

 <DIV ID="my cool id" />

The topic starter made his own answer below, consisting mostly of questions to me.

The problem is not the Regex,

Just count the quotes and arrows, in which order they are opened and in which they are closed, with pen and paper. You pattern is ( ... " ... ) .... " - it is unbalanced!

is the Delphi.

Delphi the language does not have anything to do with regexps. The libraries/components can do. So that claim has no sense. You may argue that you tested broken libraries, but not the language itself.

My regex with PHP works fine,

That should mean that either you have different regex pattern in PHP (you did not copied here PHP source) or "Problem is in PHP"

Actually we did not saw neither Delphi source nor PHP source.

array[0] = '<div (id="(?<id>[a-zA-Z0-9]+)"|name="(?<name>[a-zA-Z0-9]+))"'; - is i think not correct line in neither.

So i don't think your code and patterns in PHP program and Delphi program match each other. Show quotes of the real code being used.

the thing is that DELPHI doesn't return me

  1. Again, that just does not makle sense. Delphi is just a language, it does not know a thing about RegEx.
  2. Just above you sawthe screenshot of Delphi-written program using PCRE engine - given the repaired pattern it DOES return both name and value. So the claim is obviously wrong even in vague sense. Delphi DOES return <name, value> pair for it.

Also, I can't change the whole system to use a HTML parser, the regex is already working

Then you need to adapt regex to correctly parse the HTML snippets i shown above.

Upvotes: 2

Uwe Raabe
Uwe Raabe

Reputation: 47768

I am not sure about enumerating named groups, but you can access the group either by its index or by its name:

const
  cRegEx = '<div (id="(?<id>[a-zA-Z0-9]+)"|name="(?<name>[a-zA-Z0-9]+)")';
  cHtml = '<h1>bla bla bla</h1> <div id="home">';
var
  group: TGroup;
  match: TMatch;
  regEx: TRegEx;
begin
  regEx := TRegEx.Create(cRegEx, [roIgnoreCase,roMultiline]);
  match := regEx.Match(cHtml);
  if match.Success then begin
    group := match.Groups['id'];
    Assert(group.Value = 'home');
  end;
end;

Upvotes: 0

Related Questions