user9506309
user9506309

Reputation:

Removing HTML tags from a string of text

For a bit of a practice assignment, my professor challenged the lecture to write up some code that removes HTML tags from a string of text. He mentioned a specific command that we would learn later on that would do this for us, but he wants us to do so manually.

Here's what I have so far:

#include<iostream>
#include<string>
using namespace std;

int main() {
  string name = "<HTML> smelly </b> butts </b> smell<test>";
  cout << name << endl;

  int a = 0, b = 0;

  for (int a = b; a < name.length(); a++) {
      if (name[a] == '<') {
          for (int b = a; b < name.length(); b++) {
              if (name[b] == '>') {
                  name.erase(a, (b + 1));
                  break;
              }
          }
      }
  }

  cout << name << endl;

  system("pause");
  return 0;
}

I feel like I'm close, but I'm not getting the correct output.

Upvotes: 2

Views: 4133

Answers (2)

DoomzDay
DoomzDay

Reputation: 39

for (int b = a; b < name.length(); b++) {
    if (name[b] == '>') {
        name.erase(a, (b + 1));
        break;
    }
}

In this part of code your are erasing a part of length (b), while you should erase a part of length (b - a)

Try this one:

for (int b = a; b < name.length(); b++) {
    if (name[b] == '>') {
        name.erase(a, (b - a + 1));
        break;
    }
}

It should works as you want.

Upvotes: 1

Carl
Carl

Reputation: 2067

Here is another less convoluted and slightly cleaner way that is arguably more readable. It does not deal with nested tags, but you could expand upon it to make it better.

#include <string>
#include <iostream>

int main()
{
    std::string html = "<HTML> Something <b> slightly less </b> profane here <test>";

    while (html.find("<") != std::string::npos)
    {
        auto startpos = html.find("<");
        auto endpos = html.find(">") + 1;

        if (endpos != std::string::npos)
        {
            html.erase(startpos, endpos - startpos);
        }
    }

    std::cout << html << '\n';

    return 0;
}

For clarity, std::string::npos is returned when the sought after string has no position in the string. So while there are still HTML opening tags in the document. Erase everything between the first opening and first closing bracket you can find. It does not separate from 5 < 2 for example and <html>, so there are flaws, but it shows a different approach you can apply as a starting point.

Upvotes: 2

Related Questions