Alex
Alex

Reputation: 39

Qt Regexp extract <p> tags from Html string

I have a RichText and I store its Html source from the QTextEdit in a string. What I'd like to do is extract all the lines one-by-one (I have 4-6 lines). The string looks like this:

//html opening stuff
<p style = attributes...><span style = attributes...>My Text</span></p>
//more lines like this
//html closing stuff

So I need the WHOLE LINES from the opening p tag to the closing p tag (including the p tags too). I checked and tried everything I found around here and on other sites, but still no result.

Here's my code ("htmlStyle" is the input string):

QStringList list;
QRegExp rx("(<p[^>]*>.*?</p>)");
int pos = 0;

while ((pos = rx.indexIn(htmlStyle, pos)) != -1) {
    list << rx.cap(1);
    pos += rx.matchedLength();
}

Or is there any other way to do this without regex?

Upvotes: 0

Views: 929

Answers (3)

Alex
Alex

Reputation: 39

For those who need the full Qt solution, I figured it out based on @Aditya Poorna 's answer. Thanks for that tip!

Here's the code:

int startIndex = htmlStyle.indexOf("<p");
int endIndex = htmlStyle.indexOf("</p>");

while (startIndex >= 0) {
    endIndex = endIndex + 4;
    QStringRef subString(&htmlStyle, startIndex, endIndex-startIndex);
    qDebug() << subString;
    startIndex = htmlStyle.indexOf("<p", startIndex + 1);
    endIndex = htmlStyle.indexOf("</p>", endIndex + 1);
}

"QStringRef subString" goes in "htmlStyle" from "startIndex" until the length of "endIndex-startIndex"!

Upvotes: 0

Aditya
Aditya

Reputation: 2415

below is pure java way, hope this helps:

int startIndex = htmlStyle.indexOf("<p>");
        int endIndex = htmlStyle.indexOf("</p>");
        while (startIndex >= 0) {
            endIndex = endIndex + 4;// to include </p> in the substring
            System.out.println(htmlStyle.substring(startIndex, endIndex));
            startIndex = htmlStyle.indexOf("<p>", startIndex + 1);
            endIndex = htmlStyle.indexOf("</p>", endIndex + 1);
        }

Upvotes: 1

HTML/XML is not a regular grammar. You cannot parse it with a regex. See e.g. this question. Parsing HTML is not trivial.

You can iterate the paragraphs in a rich text document using QTextDocument, QTextBlock, QTextCursor, etc. All the HTML parsing is taken care of for you. This is exactly the subset of HTML that is supported by QTextEdit: it uses QTextDocument as an internal representation. You can get it directly from the widget using QTextEdit::document(). E.g:

void iterate(QTextEdit * edit) {
   auto const & doc = *edit->document();
   for (auto block = doc.begin(); block != doc.end(); block.next()) {
      // do something with text block e.g. iterate its fragments
      for (auto fragment = block.begin(); fragment != block.end(); fragment++) {
         // do something with text fragment
      }
   }
}

Instead of incorrectly parsing HTML by hand you should explore the structure of the QTextDocument and use it as needed.

Upvotes: 2

Related Questions