Nyxynyx
Nyxynyx

Reputation: 63619

Cleaning up HTML

I need to clean up chunks of HTML code (containing <tables>, <div>, <p>, <span id='something'>) into a more standardized HTML with no styling of the original HTML code, with no <tables> <div> <span>. One way of cleaning up can be to remove all original HTML tags except for <ul> <li> <ol> <br> <strong> and <td> <tr> will be replaced by <p>. The tags that remain will be stripped of classes, ids and attributes.

How can I do this? At the moment, I am using strip_tags() which removes all tags but squeezes everything remaining into a single line, making it hard to read.

Example HTML Code to Cleanup

<table cellpadding="0" cellspacing="0" width="100%">
<tr><td align="center">
<table cellpadding="0" cellspacing="0" width="850">
<tr valign="top">
<td width="50%" style="padding-left:30px;">
<img border="0" src="http://www.newpads.info/l/AC-000-078.gif"><br>2000 Massachusetts Ave., Cambridge, MA 02140<br>Phone: (617) 498-0011 - Fax: (617) 498-0044<br><a href="http://www.windsorrealty.net" rel="nofollow">http://www.windsorrealty.net</a></td>
<td width="35%" style="border-left: 1px solid gray; padding-left: 15px">
<div><span style="font-weight:bold;">Sugandha Singh</span></div>
<div style="padding:10px;">
<div style="padding:2px;"><img src="http://www.newpads.info/img/phone.gif"> 781 985 4489</div>
<div style="padding:2px;"><img src="http://www.newpads.info/img/email2.gif"> [email protected]</div>
<div><img src="http://www.newpads.info/img/question.gif"> <a href="http://ag006436.speedhatch.com/rentals/CAM-058-197/inquiry" rel="nofollow"><font size="3">Ask Me A Question</font></a></div><div><img src="http://www.newpads.info/img/magnet.png">  <a href="http://ag006436.speedhatch.com" rel="nofollow"><font size="3">Search My Apartments</font></a></div></div>
</td>
</tr>
</table>
<br><table width="850">
<tr><td colspan="2" height="2" bgcolor="#275c7d"></td></tr><tr><tr><td colspan="2"><div style="font-weight:bold;"><font size="3">HARVARD LAW / SQUARE. HEAT+HOTWATER INCL. JAN 1. 1/2 FEE</font></div></td></tr></tr><tr valign="top"><td><img src="http://maps.google.com/maps/api/staticmap?center=42.38047,-71.121008&amp;path=weight:4|42.37847,-71.118008|42.37847,-71.124008|42.38247,-71.124008|42.38247,-71.118008|42.37847,-71.118008&amp;zoom=15&amp;size=335x225&amp;sensor=false" style="width:275px;"></td><td><font size="2"><table style="width:100%;height:100%;"><tr valign="top"><td width="50%"><table cellpadding="3" style="width:100%;"><tr><td colspan="2" style="font-weight:bold;">Basic Info</td></tr><tr><td style="width:45%;">Referral ID:</td><td>CAM-058-197</td></tr><tr><td>Beds: 1</td><td>Baths: 1</td></tr><tr><td>Rent:</td><td>$1800</td></tr><tr><td>Broker Fee:</td><td>Half Month</td></tr><tr><td>Date Avail:</td><td>January 1st</td></tr><tr><td>Rent Includes:</td><td>Heat, Hot Water</td></tr><tr><td>Pet Policy:</td><td>Cat Ok</td></tr><tr><td colspan="2">on Langdon St., Cambridge - Harvard Square</td></tr></table></td><td width="50%"><table cellpadding="5" style="width:100%;"><tr><td colspan="2" style="font-weight:bold;">Apartment Features</td></tr><tr><td width="50%">- Gas Range</td><td width="50%">- HT&HW</td></tr><tr><td width="50%">- Modern Bath</td><td width="50%">- Modern Kitchen</td></tr><tr><td width="50%">- Storage - Basement</td><td width="50%"></td></tr></table></td></tr><tr><td colspan="2"></td></tr></table></font></td></tr><tr><td colspan="3"><table width="100%" border="0" cellspacing="0" cellpadding="3"><tr><td colspan="2" align="center"><b>Transportation options</b></td></tr><tr><td width="50%"><div><div><div style="text-align:center;text-decoration:underline;">Subway Lines and Stops</div><ul><li>RED - Harvard Square (11 min)</li></ul></td><td width="50%"><div style="text-align:center;text-decoration:underline;">Bus Routes and Stops</div><ul><li>74 - Waterhouse St & Massachusetts Ave (5 min)</li><li>72 - Waterhouse St & Massachusetts Ave (5 min)</li><li>77 - Massachusetts Ave & Waterhouse St (5 min)</li><li>75 - Waterhouse St & Massachusetts Ave (5 min)</li><li>71 - Waterhouse St & Massachusetts Ave (5 min)</li><li>And More...</li></ul></div></div></td></tr></table></td></tr><tr><td colspan="3"><div><b><font size="2">Apartment Description:</font></b></div><div style="padding:5px;"><font size="2">Recent Renovations. Great Location. Easy Walk to Harvard Law School or Harvard Square.<br>All Hardwood Floors, Kitchen w/Dining Area, Good Closet Space, Laundry Facilities.<br>(pics. of similar unit in the bldg)<br>HEAT and HOT WATER is INCLUDED in the RENT!<br>Available January 1.</font></div><br></td></tr><tr><td colspan="3"><div><strong>Similar Properties</strong></div><div>1 Bd on Huron Ave., $1835, NO FEE, Include Util., Avail Now</div><div>1 Bd on Huron Ave., $1810, Include Util., NO FEE, Avail Now</div></td></tr><tr><td colspan="2" height="2" bgcolor="#275c7d"></td></tr></table><br><table width="850" cellpadding="0" cellspacing="0" border="0"><tr><td style="text-align:center;width:50%;"><img src="http://www.newpads.info/p/2373415.jpg" width="400" border="0"></td><td style="text-align:center;width:50%;"><img src="http://www.newpads.info/p/2373416.jpg" width="400" border="0"></td></tr><tr><tr><td height="10"></td></tr><td style="text-align:center;width:50%;"><img src="http://www.newpads.info/p/2373417.jpg" width="400" border="0"></td><td style="text-align:center;width:50%;"><img src="http://www.newpads.info/p/2373418.jpg" width="400" border="0"></td></tr><tr><tr><td height="10"></td></tr><td style="text-align:center;width:50%;"><img src="http://www.newpads.info/p/2373419.jpg" width="400" border="0"></td><td style="text-align:center;width:50%;"><img src="http://www.newpads.info/p/2373420.jpg" width="400" border="0"></td></tr><tr><tr><td height="10"></td></tr></tr></table><table width="100%"><tr><td height="20"></td></tr><tr><td align="center"><font size="4">Contact <strong>Sugandha Singh</strong> at 781 985 4489 or [email protected].</font></td></tr></table><table width="100%" cellspacing="0" cellpadding="0"><tr><td height="25"></td></tr><tr><td align="center"><div style="font-family: Verdana, sans-serif;"><font size="0.6">Equal Housing Opportunity - Windsor Realty is not responsible for any errors or omissions. Terms, conditions and rent are subject to change without prior notice. The information gathered is from third party sources including the owner and public records and is not guaranteed.</font></div></td></tr></table></td></tr></table><img src="http://www.newpads.info/CLAD/904329.gif">

After strip_tags():

2000 Massachusetts Ave., Cambridge, MA 02140Phone: (617) 498-0011 - Fax: (617) 498-0044http://www.windsorrealty.netSugandha SinghPhone: 781 985 4489Email: [email protected] Ask Me A Question  Search My Apartments1 Bd on Concord Ave., HT/HW, Avail 01/01$1500 / MonthApartment DetailsApartment FeaturesReferral ID:WES-008-561Available:January 1stRent:$1500Bed(s):1Bath(s):1Rent Includes:Heat, Hot WaterFee:One MonthSubway Lines and StopsRED - Harvard Square (13 min)Bus Routes and Stops75 - Garden St Opp Mason St (7 min)74 - Garden St Opp Mason St (7 min)72 - Garden St Opp Mason St (7 min)78 - Concord Ave & Huron Ave (7 min)77 - Massachusetts Ave & Waterhouse St (8 min)And More...Contact Sugandha Singh at 781 985 4489 or [email protected] Housing Opportunity - Windsor Realty is not responsible for any errors or omissions. Terms, conditions and rent are subject to change without prior notice. The information gathered is from third party sources including the owner and public records and is not guaranteed.

Note: I'm using Codeigniter if this helps with any parsing functions that it has.

Upvotes: 0

Views: 251

Answers (4)

hakre
hakre

Reputation: 197775

You can tell strip_tags() which tags to preserve. That done you only need to solve the problem for replacing <td> and <tr> elements with <p>.

That can be done with the DOMDocument class by running an xpath query selecting the elements and replacing them with <p> elements while taking over the original elements children.

Some related code can be found in a previous answer (Question: Extract all the text and img tags from HTML in PHP.), for moving children around, there is another answer (Qustion: How do you remove duplicate, nested DOM elements in PHP?) that shows that.

Upvotes: 1

ajreal
ajreal

Reputation: 47321

$doc = new DOMDocument();
$doc->loadHTML(...);
$xpath = new DOMXpath($doc);
$nodes = $xpath->query("//*");

$rtn = array();
foreach ($nodes as $node)
{
    switch ($node->nodeName)
    {
        case "ul":
        case "li":
        case "ol":
        case "br":
        case "strong":
            $rtn[] = $node->nodeValue;
            break;
    }
}

Upvotes: 1

mat
mat

Reputation: 1629

http://php.net/manual/en/function.str-replace.php for replacing your tags

http://nl2.php.net/manual/en/function.strip-tags.php and don't forget to exclude the tags you need.

http://nl2.php.net/manual/en/function.nl2br.php for the enters

Upvotes: 0

Tim
Tim

Reputation: 699

PhpQuery is a good option in this case. Highly recommended.

Upvotes: 0

Related Questions