Honeycrisp
Honeycrisp

Reputation: 53

match string length; find and Replace; remove HTML tags in C++

here is my question
I'm trying to remove HTML tags, simply format them (like new line for every information), and write other program to extract useful information later.
I have read the data into a large string array so far.
here is my thought: find all tags which is in <> and replace them with blank. after checking online, I saw some programs that can do the find and replacement part.
Because different HTML tags have different length, so I'm wondering is there a way in C++ that can detect any length of the tag, as long as it's in the <>. and then just replace them.
here is one sample code of HTML

<div style="width: 200px;"><strong>Balance Sheets (USD $)<br></strong></div>

and I hope to get after operations:

Balance Sheets (USD $)


could anyone help? or does anyone have a better idea to deal with this task? Any help is greatly appreciated!

<DOCUMENT>
<TYPE>XML
<SEQUENCE>33
<FILENAME>R2.htm
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
<html>
  <head>
    <META http-equiv="Content-Type" content="text/html; charset=us-ascii">
    <link rel="StyleSheet" type="text/css" href="report.css"><script type="text/javascript" src="Show.js">/* Do Not Remove This Comment */</script></head>
  <body><span style="display: none;">v2.4.0.6</span><table class="report" border="0" cellspacing="2" id="ID0EQAAG">
      <tr>
        <th class="tl" colspan="1" rowspan="1">
          <div style="width: 200px;"><strong>Balance Sheets (USD $)<br></strong></div>
        </th>
        <th class="th">
          <div>Dec. 31, 2012</div>
        </th>
        <th class="th">
          <div>Dec. 31, 2011</div>
        </th>
      </tr>
      <tr class="re">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_tpoi_CurrentAssetsAbstract', window );"><strong>Current assets:</strong></a></td>
        <td class="text">&#xA0;<span></span></td>
        <td class="text">&#xA0;<span></span></td>
      </tr>
      <tr class="ro">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_CashAndCashEquivalentsAtCarryingValue', window );">Cash and cash equivalents</a></td>
        <td class="nump">$ 106,999<span></span></td>
        <td class="nump">$ 52,109<span></span></td>
      </tr>
      <tr class="re">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_AccountsReceivableNetCurrent', window );">Accounts receivable</a></td>
        <td class="nump">110,720<span></span></td>
        <td class="nump">61,218<span></span></td>
      </tr>
      <tr class="ro">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_tpoi_AccountsReceivableRelatedParty', window );">Accounts receivable - related party</a></td>
        <td class="nump">1,527<span></span></td>
        <td class="text">&#xA0;<span></span></td>
      </tr>
      <tr class="reu">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_AssetsCurrent', window );">Total current assets</a></td>
        <td class="nump">219,246<span></span></td>
        <td class="nump">113,327<span></span></td>
      </tr>
      <tr class="ro">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_PropertyPlantAndEquipmentNet', window );">Property and equipment, net</a></td>
        <td class="nump">152,724<span></span></td>
        <td class="nump">160,454<span></span></td>
      </tr>
      <tr class="re">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_CapitalizedSoftwareDevelopmentCostsForSoftwareSoldToCustomers', window );">Capitalized software development costs</a></td>
        <td class="nump">188,371<span></span></td>
        <td class="text">&#xA0;<span></span></td>
      </tr>
      <tr class="ro">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_IntangibleAssetsNetExcludingGoodwill', window );">Intangible assets, net</a></td>
        <td class="nump">59,151<span></span></td>
        <td class="nump">59,151<span></span></td>
      </tr>
      <tr class="re">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_OtherAssetsCurrent', window );">Other assets</a></td>
        <td class="nump">11,622<span></span></td>
        <td class="nump">15,470<span></span></td>
      </tr>
      <tr class="rou">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_NoncurrentAssets', window );">Total long term assets</a></td>
        <td class="nump">411,868<span></span></td>
        <td class="nump">235,075<span></span></td>
      </tr>
      <tr class="reu">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_Assets', window );">Total assets</a></td>
        <td class="nump">631,114<span></span></td>
        <td class="nump">348,402<span></span></td>
      </tr>
      <tr class="ro">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_tpoi_CurrentLiabilitiesAbstract', window );"><strong>Current liabilities:</strong></a></td>
        <td class="text">&#xA0;<span></span></td>
        <td class="text">&#xA0;<span></span></td>
      </tr>
      <tr class="re">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_AccountsPayableCurrent', window );">Accounts payable</a></td>
        <td class="nump">50,866<span></span></td>
        <td class="nump">70,253<span></span></td>
      </tr>
      <tr class="ro">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_AccruedLiabilitiesCurrent', window );">Accrued liabilities</a></td>
        <td class="nump">1,452<span></span></td>
        <td class="nump">6,752<span></span></td>
      </tr>
      <tr class="reu">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_Liabilities', window );">Total current liabilities</a></td>
        <td class="nump">52,318<span></span></td>
        <td class="nump">77,005<span></span></td>
      </tr>
      <tr class="ro">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_OtherAccruedLiabilitiesNoncurrent', window );">Other non current liabilities, accrued interest</a></td>
        <td class="nump">7,500<span></span></td>
        <td class="nump">1,500<span></span></td>
      </tr>
      <tr class="re">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_NotesPayableCurrent', window );">Notes payable</a></td>
        <td class="nump">50,000<span></span></td>
        <td class="nump">50,000<span></span></td>
      </tr>
      <tr class="ro">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_DueToRelatedPartiesCurrent', window );">Notes payable - related party</a></td>
        <td class="nump">100,000<span></span></td>
        <td class="nump">100,000<span></span></td>
      </tr>
      <tr class="reu">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_OtherLiabilitiesCurrent', window );">Total long-term liabilities</a></td>
        <td class="nump">157,500<span></span></td>
        <td class="nump">151,500<span></span></td>
      </tr>
      <tr class="rou">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_LiabilitiesCurrent', window );">Total liabilities</a></td>
        <td class="nump">209,818<span></span></td>
        <td class="nump">228,505<span></span></td>
      </tr>
      <tr class="re">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_tpoi_ShareholdersEquityAbstract', window );"><strong>Shareholders&#x2019; equity:</strong></a></td>
        <td class="text">&#xA0;<span></span></td>
        <td class="text">&#xA0;<span></span></td>
      </tr>
      <tr class="ro">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_CommonStockValue', window );">Common stock, $0 par value, 30,000,000 shares authorized, 13,312,302 and 9,312,302 shares issued and outstanding at December 31, 2012 and 2011, respectively</a></td>
        <td class="nump">1,542,651<span></span></td>
        <td class="nump">946,151<span></span></td>
      </tr>
      <tr class="re">
        <td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_AccumulatedDepreciationDepletionAndAmortizationPropertyPlantAndEquipment', window );">Accumulated deficit</a></td>
        <td class="num">(1,145,758)<span></span></td>
        <td class="num">(837,310)<span></span></td>
      </tr>

**Many many HTML code here...... Many Many Many**
                    <td><strong> Name:</strong></td>
                    <td><nobr>us-gaap_StockholdersEquity</nobr></td>
                  </tr>
                  <tr>
                    <td style="padding-right: 4px;"><nobr><strong> Namespace Prefix:</strong></nobr></td>
                    <td>us-gaap_</td>
                  </tr>
                  <tr>
                    <td><strong> Data Type:</strong></td>
                    <td>xbrli:monetaryItemType</td>
                  </tr>
                  <tr>
                    <td><strong> Balance Type:</strong></td>
                    <td>credit</td>
                  </tr>
                  <tr>
                    <td><strong> Period Type:</strong></td>
                    <td>instant</td>
                  </tr>
                </table>
              </div>
            </div>
          </td>
        </tr>
      </table>
    </div>
  </body>
</html>
</TEXT>
</DOCUMENT>

the code is like this but way longer

Upvotes: 1

Views: 5477

Answers (3)

Captain Obvlious
Captain Obvlious

Reputation: 20073

Since it isn't clear exactly what you are trying to accomplish the solution below breaks an HTML document into individual tags and lines of text. There are probably a few corner cases that are not handled but it does handle attribute strings in case they contain the end tag delimiter. It was written quickly and not much testing has been done so I will leave any necessary fixes up to you. It's not pretty but works and should be enough to get you started.

#include <vector>
#include <string>
#include <iostream>


int main()
{
    std::string html("<div style=\"width: 200px;\"><strong>Balance Sheets (USD $)<br></strong></div>");
    std::vector<std::string>    tags;
    std::vector<std::string>    text;

    for(;;)
    {
        std::string::size_type  startpos;

        startpos = html.find('<');
        if(startpos == std::string::npos)
        {
            // no tags left only text!
            text.push_back(html);
            break;
        }

        // handle the text before the tag    
        if(0 != startpos)
        {
            text.push_back(html.substr(0, startpos));
            html = html.substr(startpos, html.size() - startpos);
            startpos = 0;
        }

        //  skip all the text in the html tag
        std::string::size_type endpos;
        for(endpos = startpos;
            endpos < html.size() && html[endpos] != '>';
            ++endpos)
        {
            // since '>' can appear inside of an attribute string we need
            // to make sure we process it properly.
            if(html[endpos] == '"')
            {
                endpos++;
                while(endpos < html.size() && html[endpos] != '"')
                {
                    endpos++;
                }
            }
        }

        //  Handle text and end of html that has beginning of tag but not the end
        if(endpos == html.size())
        {
            html = html.substr(endpos, html.size() - endpos);
            break;
        }
        else
        {
            //  handle the entire tag
            endpos++;
            tags.push_back(html.substr(startpos, endpos - startpos));
            html = html.substr(endpos, html.size() - endpos);
        }
    }

    std::cout << "tags:\n-----------------" << std::endl;

    // auto, iterators or range based for loop would probably be better but
    // this makes it a bit easier to read.    
    for(size_t i = 0; i < tags.size(); i++)
    {
        std::cout << tags[i] << std::endl;
    }

    std::cout << "\ntext:\n-----------------" << std::endl;
    for(size_t i = 0; i < text.size(); i++)
    {
        std::cout << text[i] << std::endl;
    }
}

The above code generates the following output (without the space after < since the SO markdown interprets it a an HTML tag like it should)

tags:

< div style="width: 200px;">
< strong>
< br>
< /strong>
< /div>

text:

Balance Sheets (USD $)

Upvotes: 3

learner
learner

Reputation: 1982

one suggestion is to obtain all tags in html and add them to a hash map then read a tag(after "<" ) and search it using hash map and made your code act accordingly.

Upvotes: 0

Ankit Jain
Ankit Jain

Reputation: 148

Regular Expressions are not a way to too with HTML. Try using some HTML parsing libraries. Examples are Expat or Xerces.

These libraries help you to read text representation of entire HTML OR html-tags, their attributes, etc.

Upvotes: 1

Related Questions