SearchSpace
SearchSpace

Reputation: 205

QT C++ QRegularExpression multiple matches

I want to extract Information from a QString (.html) by using Regular Expressions. I explicitly want to use Regex (no Parser Solutions) and the class QRegularExpression (for several reasons e.g.: Reasons).

For simplification aspects here is an problem equivalent task.

Constructed source string:

<foo><bar s>INFO1.1</bar> </ qux> <peter></peter><bar e>INFO1.2
</bar><fred></ senseless></fred></ xx><lol></lol></foo><bar s>INFO2.1</bar>
</ nothing><endlessSenselessTags></endlessSenselessTags><rofl>
<bar e>INFO2.2</bar></rofl>

*Note:*There could be more or less INFOs and additional sensless tags. (6 Infos e.g.)

Wanted:

Info1.1 and Info1.2 and Info2.1 and Info2.2 (e.g. in List)

Attempt

1.

QRegularExpression reA(".*<bar [es]>(.*)</bar>.*", QRegularExpression::DotMatchesEverythingOption);

->

INFOa</bar> </ qux> <peter></peter><bar e>INFOb
    </bar><fred></ senseless></fred></ xx><lol></lol></foo><bar s>INFOc</bar>
    </ nothing><endlessSenselessTags></endlessSenselessTags><rofl>
    <bar e>INFOd

2.

QRegularExpression reA("(.*<bar [es]>(.*)</bar>.*)*", QRegularExpression::DotMatchesEverythingOption);

->senseless

Problem: The Regex is always related to the whole String. <bar s>INFO</bar><bar s>INFO</bar> would select the first <bar s> and the last and </bar>. Wanted is first

With QRegExp there seems to be a solution, but i want to do this with QRegularExpression.

Upvotes: 9

Views: 11414

Answers (2)

kayleeFrye_onDeck
kayleeFrye_onDeck

Reputation: 6958

I'm adding a new similar answer due to the vexing lack of QRegularExpression answers that handle all capture groups specified, and not by name. I just wanted to be able to specify capture groups and get only those results, not the whole kitchen sink. That becomes a problem when blindly grabbing capture group 0, which is what almost all answers on SO do for QRegularExpressions with multiple results. This answer gets back all specified capture groups' in a list, and if no capture groups were specified, it returns capture-group 0 for a whole-regex match.

I made this simplified code-snippet on Gist that doesn't directly address this question. The sample app below if a diff that does address this specific question.

#include <QCoreApplication>
#include <QRegularExpressionMatch>
#include <QStringList>
#include <iostream>
int main(int argc, char *argv[])
{
    QCoreApplication a(argc, argv);

    QStringList results;
    QRegularExpression this_regex("<bar \\w>(.*?)</bar>");
    QString test_string =   "<foo><bar s>INFO1.1</bar> </ qux> <peter></peter><bar e>INFO1.2\n\
                             </bar><fred></ senseless></fred></ xx><lol></lol></foo><bar s>INFO2.1</bar>\n\
                             </ nothing><endlessSenselessTags></endlessSenselessTags><rofl>\n\
                             <bar e>INFO2.2</bar></rofl>\n";

    if(!this_regex.isValid())
    {
        std::cerr << "Invalid regex pattern: " << this_regex.pattern().toStdString() << std::endl;
        return -2;
    }

    for (int i = 0; i < this_regex.captureCount()+1; ++i)
    {
        // This skips storing capture-group 0 if any capture-groups were actually specified.
        // If they weren't, capture-group 0 will be the only thing returned.    
        if((i!=0) || this_regex.captureCount() < 1)
        {
            QRegularExpressionMatchIterator iterator = this_regex.globalMatch(test_string);    
            while (iterator.hasNext())
            {
                QRegularExpressionMatch match = iterator.next();    
                QString matched = match.captured(i);    
                // Remove this if-check if you want to keep zero-length results
                if(matched.length() > 0){results << matched;}
            }
        }
    }

    if(results.length()==0){return -1;}

    for(int i = 0; i < results.length(); i++)
    {
        std::cout << results.at(i).toStdString() << std::endl;
    }

    return 0;
}

Output in console:

 INFO1.1
 INFO2.1
 INFO2.2

To me, dealing with Regular Expressions using QRegularExpression is less painful than the std::regex's, but they're both pretty general and robust, requiring more fine-tuned result-handling. I always use a wrapper I made for QRegularExpressions to quickly make the kind of regexes and results that I typically want to leverage.

Upvotes: 2

Salvatore Avanzo
Salvatore Avanzo

Reputation: 2786

Maybe you can try with this

QRegularExpression reA("(<bar [se]>[^<]+</bar>)");

QRegularExpressionMatchIterator i = reA.globalMatch(input);
while (i.hasNext()) {
    QRegularExpressionMatch match = i.next();
    if (match.hasMatch()) {
         qDebug() << match.captured(0);
    }
}

that gives me this output

"<bar s>INFO1.1</bar>" 
"<bar e>INFO1.2
</bar>" 
"<bar s>INFO2.1</bar>" 
"<bar e>INFO2.2</bar>"  

while this expression

QRegularExpression reA("((?<=<bar [se]>)((?!</bar>).)+(?=</bar>))",
                       QRegularExpression::DotMatchesEverythingOption);

with this input

<foo><bar s>INFO1</lol>.1</bar> </ qux> <peter></peter><bar e>INFO1.2
</bar><fred></ senseless></fred></ xx><lol></lol></foo><bar s>INFO2.1</bar>
</ nothing><endlessSenselessTags></endlessSenselessTags><rofl>
<bar e>INFO2.2</bar></rofl>

gives me as output

"INFO1</lol>.1" 
"INFO1.2
" 
"INFO2.1" 
"INFO2.2"

Upvotes: 14

Related Questions