Reputation: 53
here is my question
I'm trying to remove HTML tags, simply format them (like new line for every information), and write other program to extract useful information later.
I have read the data into a large string array so far.
here is my thought: find all tags which is in <> and replace them with blank. after checking online, I saw some programs that can do the find and replacement part.
Because different HTML tags have different length, so I'm wondering is there a way in C++ that can detect any length of the tag, as long as it's in the <>. and then just replace them.
here is one sample code of HTML
<div style="width: 200px;"><strong>Balance Sheets (USD $)<br></strong></div>
and I hope to get after operations:
Balance Sheets (USD $)
could anyone help? or does anyone have a better idea to deal with this task? Any help is greatly appreciated!
<DOCUMENT>
<TYPE>XML
<SEQUENCE>33
<FILENAME>R2.htm
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=us-ascii">
<link rel="StyleSheet" type="text/css" href="report.css"><script type="text/javascript" src="Show.js">/* Do Not Remove This Comment */</script></head>
<body><span style="display: none;">v2.4.0.6</span><table class="report" border="0" cellspacing="2" id="ID0EQAAG">
<tr>
<th class="tl" colspan="1" rowspan="1">
<div style="width: 200px;"><strong>Balance Sheets (USD $)<br></strong></div>
</th>
<th class="th">
<div>Dec. 31, 2012</div>
</th>
<th class="th">
<div>Dec. 31, 2011</div>
</th>
</tr>
<tr class="re">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_tpoi_CurrentAssetsAbstract', window );"><strong>Current assets:</strong></a></td>
<td class="text"> <span></span></td>
<td class="text"> <span></span></td>
</tr>
<tr class="ro">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_CashAndCashEquivalentsAtCarryingValue', window );">Cash and cash equivalents</a></td>
<td class="nump">$ 106,999<span></span></td>
<td class="nump">$ 52,109<span></span></td>
</tr>
<tr class="re">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_AccountsReceivableNetCurrent', window );">Accounts receivable</a></td>
<td class="nump">110,720<span></span></td>
<td class="nump">61,218<span></span></td>
</tr>
<tr class="ro">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_tpoi_AccountsReceivableRelatedParty', window );">Accounts receivable - related party</a></td>
<td class="nump">1,527<span></span></td>
<td class="text"> <span></span></td>
</tr>
<tr class="reu">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_AssetsCurrent', window );">Total current assets</a></td>
<td class="nump">219,246<span></span></td>
<td class="nump">113,327<span></span></td>
</tr>
<tr class="ro">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_PropertyPlantAndEquipmentNet', window );">Property and equipment, net</a></td>
<td class="nump">152,724<span></span></td>
<td class="nump">160,454<span></span></td>
</tr>
<tr class="re">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_CapitalizedSoftwareDevelopmentCostsForSoftwareSoldToCustomers', window );">Capitalized software development costs</a></td>
<td class="nump">188,371<span></span></td>
<td class="text"> <span></span></td>
</tr>
<tr class="ro">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_IntangibleAssetsNetExcludingGoodwill', window );">Intangible assets, net</a></td>
<td class="nump">59,151<span></span></td>
<td class="nump">59,151<span></span></td>
</tr>
<tr class="re">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_OtherAssetsCurrent', window );">Other assets</a></td>
<td class="nump">11,622<span></span></td>
<td class="nump">15,470<span></span></td>
</tr>
<tr class="rou">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_NoncurrentAssets', window );">Total long term assets</a></td>
<td class="nump">411,868<span></span></td>
<td class="nump">235,075<span></span></td>
</tr>
<tr class="reu">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_Assets', window );">Total assets</a></td>
<td class="nump">631,114<span></span></td>
<td class="nump">348,402<span></span></td>
</tr>
<tr class="ro">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_tpoi_CurrentLiabilitiesAbstract', window );"><strong>Current liabilities:</strong></a></td>
<td class="text"> <span></span></td>
<td class="text"> <span></span></td>
</tr>
<tr class="re">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_AccountsPayableCurrent', window );">Accounts payable</a></td>
<td class="nump">50,866<span></span></td>
<td class="nump">70,253<span></span></td>
</tr>
<tr class="ro">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_AccruedLiabilitiesCurrent', window );">Accrued liabilities</a></td>
<td class="nump">1,452<span></span></td>
<td class="nump">6,752<span></span></td>
</tr>
<tr class="reu">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_Liabilities', window );">Total current liabilities</a></td>
<td class="nump">52,318<span></span></td>
<td class="nump">77,005<span></span></td>
</tr>
<tr class="ro">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_OtherAccruedLiabilitiesNoncurrent', window );">Other non current liabilities, accrued interest</a></td>
<td class="nump">7,500<span></span></td>
<td class="nump">1,500<span></span></td>
</tr>
<tr class="re">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_NotesPayableCurrent', window );">Notes payable</a></td>
<td class="nump">50,000<span></span></td>
<td class="nump">50,000<span></span></td>
</tr>
<tr class="ro">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_DueToRelatedPartiesCurrent', window );">Notes payable - related party</a></td>
<td class="nump">100,000<span></span></td>
<td class="nump">100,000<span></span></td>
</tr>
<tr class="reu">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_OtherLiabilitiesCurrent', window );">Total long-term liabilities</a></td>
<td class="nump">157,500<span></span></td>
<td class="nump">151,500<span></span></td>
</tr>
<tr class="rou">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_LiabilitiesCurrent', window );">Total liabilities</a></td>
<td class="nump">209,818<span></span></td>
<td class="nump">228,505<span></span></td>
</tr>
<tr class="re">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_tpoi_ShareholdersEquityAbstract', window );"><strong>Shareholders’ equity:</strong></a></td>
<td class="text"> <span></span></td>
<td class="text"> <span></span></td>
</tr>
<tr class="ro">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_CommonStockValue', window );">Common stock, $0 par value, 30,000,000 shares authorized, 13,312,302 and 9,312,302 shares issued and outstanding at December 31, 2012 and 2011, respectively</a></td>
<td class="nump">1,542,651<span></span></td>
<td class="nump">946,151<span></span></td>
</tr>
<tr class="re">
<td class="pl" style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_AccumulatedDepreciationDepletionAndAmortizationPropertyPlantAndEquipment', window );">Accumulated deficit</a></td>
<td class="num">(1,145,758)<span></span></td>
<td class="num">(837,310)<span></span></td>
</tr>
**Many many HTML code here...... Many Many Many**
<td><strong> Name:</strong></td>
<td><nobr>us-gaap_StockholdersEquity</nobr></td>
</tr>
<tr>
<td style="padding-right: 4px;"><nobr><strong> Namespace Prefix:</strong></nobr></td>
<td>us-gaap_</td>
</tr>
<tr>
<td><strong> Data Type:</strong></td>
<td>xbrli:monetaryItemType</td>
</tr>
<tr>
<td><strong> Balance Type:</strong></td>
<td>credit</td>
</tr>
<tr>
<td><strong> Period Type:</strong></td>
<td>instant</td>
</tr>
</table>
</div>
</div>
</td>
</tr>
</table>
</div>
</body>
</html>
</TEXT>
</DOCUMENT>
the code is like this but way longer
Upvotes: 1
Views: 5477
Reputation: 20073
Since it isn't clear exactly what you are trying to accomplish the solution below breaks an HTML document into individual tags and lines of text. There are probably a few corner cases that are not handled but it does handle attribute strings in case they contain the end tag delimiter. It was written quickly and not much testing has been done so I will leave any necessary fixes up to you. It's not pretty but works and should be enough to get you started.
#include <vector>
#include <string>
#include <iostream>
int main()
{
std::string html("<div style=\"width: 200px;\"><strong>Balance Sheets (USD $)<br></strong></div>");
std::vector<std::string> tags;
std::vector<std::string> text;
for(;;)
{
std::string::size_type startpos;
startpos = html.find('<');
if(startpos == std::string::npos)
{
// no tags left only text!
text.push_back(html);
break;
}
// handle the text before the tag
if(0 != startpos)
{
text.push_back(html.substr(0, startpos));
html = html.substr(startpos, html.size() - startpos);
startpos = 0;
}
// skip all the text in the html tag
std::string::size_type endpos;
for(endpos = startpos;
endpos < html.size() && html[endpos] != '>';
++endpos)
{
// since '>' can appear inside of an attribute string we need
// to make sure we process it properly.
if(html[endpos] == '"')
{
endpos++;
while(endpos < html.size() && html[endpos] != '"')
{
endpos++;
}
}
}
// Handle text and end of html that has beginning of tag but not the end
if(endpos == html.size())
{
html = html.substr(endpos, html.size() - endpos);
break;
}
else
{
// handle the entire tag
endpos++;
tags.push_back(html.substr(startpos, endpos - startpos));
html = html.substr(endpos, html.size() - endpos);
}
}
std::cout << "tags:\n-----------------" << std::endl;
// auto, iterators or range based for loop would probably be better but
// this makes it a bit easier to read.
for(size_t i = 0; i < tags.size(); i++)
{
std::cout << tags[i] << std::endl;
}
std::cout << "\ntext:\n-----------------" << std::endl;
for(size_t i = 0; i < text.size(); i++)
{
std::cout << text[i] << std::endl;
}
}
The above code generates the following output (without the space after < since the SO markdown interprets it a an HTML tag like it should)
tags:
< div style="width: 200px;">
< strong>
< br>
< /strong>
< /div>text:
Balance Sheets (USD $)
Upvotes: 3
Reputation: 1982
one suggestion is to obtain all tags in html and add them to a hash map then read a tag(after "<" ) and search it using hash map and made your code act accordingly.
Upvotes: 0
Reputation: 148
Regular Expressions are not a way to too with HTML. Try using some HTML parsing libraries. Examples are Expat or Xerces.
These libraries help you to read text representation of entire HTML OR html-tags, their attributes, etc.
Upvotes: 1