EnexoOnoma
EnexoOnoma

Reputation: 8836

How to convert this HTML table to XML?

Here are two tables containing data I would like to have in an XML format. The actual thing is more than two with a random number of rows.

<table width="100%" align="center" class="mytable" border="1" cellspacing="1">
  <tr><td width="100%"><b>Δ.Ο.Υ. Α' ΑΘΗΝΩΝ (Α',Β',Γ',ΙΕ',ΚΒ') Κ.Α.: 1101</b> Αναξαγόρα 6-8, T.K. 100 10 Αθήνα</a><a name="aa8inon"></a></td></tr>
    <tr><td width="8%">Προϊστάμενος</td><td width="8%">&nbsp;</td><td width="8%"><b>210</b>-52.72.810, 770</td></tr>
    <tr><td width="8%">Υποδιευθυντής Φορολογίας</td><td width="8%">&nbsp;</td><td width="8%"><b>210</b>-52.72.804</td></tr>
    <tr><td width="8%">Υποδιευθυντής Ελέγχου</td><td width="8%"><b>213</b> 1604121</td><td width="8%"><b>210</b>-52.72.807</td></tr>
</table>

<table width="100%" align="center" class="mytable" border="1" cellspacing="1">
  <tr><td width="100%"><b>Δ.Ο.Υ. ΚΑΤΟΙΚΩΝ ΕΞΩΤΕΡΙΚΟΥ Κ.Α.: 1125</b> Μετσόβου 4-T.K.  106 82 Αθήνα</td></tr>
    <tr><td width="8%">Προϊστάμενος</td><td width="8%"><b>213</b> 1607155</td><td width="8%"><b>210</b>- 8204607</td></tr>
    <tr><td width="8%">Υποδιευθυντής Φορολογίας</td><td width="8%">&nbsp;</td><td width="8%"><b>210</b>- 8204604</td></tr>
</table>

The first row below the table tag is the root element and all the other rows are child elements. Please forgive me if I make some mistakes in naming correctly the elements.

For example between the first <tr><td> you see

<b>Δ.Ο.Υ. Α' ΑΘΗΝΩΝ (Α',Β',Γ',ΙΕ',ΚΒ') Κ.Α.: 1101</b> Αναξαγόρα 6-8, T.K. 100 10 Αθήνα</a><a name="aa8inon"></a>

This would be the attribute name in the root element.

The first <td></td> of the following rows Προϊστάμενος is the child element and from the next <td> until the last </td> of the <tr> is the data for this child element.

This is what I would like to have

<note doy="<b>Δ.Ο.Υ. Α' ΑΘΗΝΩΝ (Α',Β',Γ',ΙΕ',ΚΒ') Κ.Α.: 1101</b> Αναξαγόρα 6-8, T.K. 100 10 Αθήνα</a><a name="aa8inon"></a>">
  <Προϊστάμενος>&nbsp;</td><td width="8%"><b>210</b>-52.72.810, 770</Προϊστάμενος>
  <Υποδιευθυντής Φορολογίας>&nbsp;</td><td width="8%"><b>210</b>-52.72.810, 770</Υποδιευθυντής Φορολογίας>
</note>

Is this possible? Any code is appreciated.

Upvotes: 1

Views: 10167

Answers (5)

flup
flup

Reputation: 27104

You can parse valid XHTML as XML and transform it to the desired XML format using an XML stylesheet. Since the HTML isn't valid XHTML, you'll have to tidy it first using a tool, for instance an online tidy site. There's a php library too (with sample code) if you should need to do this at runtime.

I tidied your HTML on that site, and applied the following stylesheet to it:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
    xmlns:xhtml="http://www.w3.org/1999/xhtml"
    xmlns:fn="http://www.w3.org/2005/xpath-functions">

    <xsl:template match="/xhtml:html/xhtml:body">
        <xsl:element name="notes">
            <xsl:apply-templates />
        </xsl:element>
    </xsl:template>

    <xsl:template match="xhtml:table">
        <xsl:element name="note">
            <xsl:attribute name="doy">
                <xsl:value-of select="xhtml:tr[1]/xhtml:td" />
            </xsl:attribute>
            <xsl:for-each select="xhtml:tr[position() != 1]">
                <xsl:element name="{translate(xhtml:td,' ','_')}">
                    <xsl:for-each select="xhtml:td[position() != 1]">
                        <!-- filter out empty / &nbsp; td elements -->
                        <xsl:if test="normalize-space(translate(.,'&#xc2;&#xa0;','  '))">
                            <xsl:element name="τηλέφωνο">
                                <xsl:value-of select="."  />
                            </xsl:element>
                        </xsl:if>
                    </xsl:for-each>
                </xsl:element>
            </xsl:for-each>
        </xsl:element>
    </xsl:template>
</xsl:stylesheet>

This yields:

<notes>
    <note
        doy="Δ.Ο.Υ. Α' ΑΘΗΝΩΝ (Α',Β',Γ',ΙΕ',ΚΒ') Κ.Α.: 1101 Αναξαγόρα 6-8, T.K. 100 10 Αθήνα">
        <Προϊστάμενος>
            <τηλέφωνο>210-52.72.810, 770</τηλέφωνο>
        </Προϊστάμενος>
        <Υποδιευθυντής_Φορολογίας>
            <τηλέφωνο>210-52.72.804</τηλέφωνο>
        </Υποδιευθυντής_Φορολογίας>
        <Υποδιευθυντής_Ελέγχου>
            <τηλέφωνο>213 1604121</τηλέφωνο>
            <τηλέφωνο>210-52.72.807</τηλέφωνο>
        </Υποδιευθυντής_Ελέγχου>
    </note>

    <note
        doy="Δ.Ο.Υ. ΚΑΤΟΙΚΩΝ ΕΞΩΤΕΡΙΚΟΥ Κ.Α.: 1125 Μετσόβου 4-T.K. 106 82 Αθήνα">
        <Προϊστάμενος>
            <τηλέφωνο>213 1607155</τηλέφωνο>
            <τηλέφωνο>210- 8204607</τηλέφωνο>
        </Προϊστάμενος>
        <Υποδιευθυντής_Φορολογίας>
            <τηλέφωνο>210- 8204604</τηλέφωνο>
        </Υποδιευθυντής_Φορολογίας>
    </note>
</notes>

Some notes:

  • It seems weird to me to expect HTML markup in the resulting XML. XML contains data and data isn't formatted bold an doesn't contain anchors.
  • XML element names are not allowed to contain spaces so I've replaced them with underscores
  • Must specify UTF-8 to the tidy site or it'll make gibberish of the non-ascii characters
  • I first tried to stick with the question's XML format, until I saw in a different answer that you want all phone numbers to appear in the XML. Therefore I've given them a separate surrounding element tag.
  • XML stylesheets are language independent, there's a way to apply them in most languages. For instance in php, see below, or in javascript, or even in the browser, since you can serve the .xhtml with this stylesheet and then the browser will render the XML. But it's usually done the other way around, creating a HTML representation for the XML data. I'm not so sure about when and where you need to create the XML.

Sample php code:

<?php
$xhtml_file = 'doc.xhtml';
$xsl_file = 'doc.xsl';
$doc = new DOMDocument();
$xsl = new XSLTProcessor();

$doc->load($xsl_file);
$xsl->importStyleSheet($doc);

$doc->load($xhtml_file);
echo $xsl->transformToXML($doc);
?>

Upvotes: 1

mobius
mobius

Reputation: 5174

First off I should note that the XML you want to output seems to be invalid.

You could make use of the excellent querypath library (http://querypath.org/) and you eventually you could apply the same logic from PHP to Javascript (with JQuery's selector engine)

Here is a piece of code that produces valid XML from your input (btw I am Greek so it makes more sense to me):

libxml_use_internal_errors(true);

$html = '<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /></head><body><table width="100%" align="center" class="mytable" border="1" cellspacing="1">
           <tr><td width="100%"><b>Δ.Ο.Υ. Α\' ΑΘΗΝΩΝ (Α\',Β\',Γ\',ΙΕ\',ΚΒ\') Κ.Α.: 1101</b> Αναξαγόρα 6-8, T.K. 100 10 Αθήνα</a><a name="aa8inon"></a></td></tr>
           <tr><td width="8%">Προϊστάμενος</td><td width="8%">&nbsp;</td><td width="8%"><b>210</b>-52.72.810, 770</td></tr>
           <tr><td width="8%">Υποδιευθυντής Φορολογίας</td><td width="8%">&nbsp;</td><td width="8%"><b>210</b>-52.72.804</td></tr>
           <tr><td width="8%">Υποδιευθυντής Ελέγχου</td><td width="8%"><b>213</b> 1604121</td><td width="8%"><b>210</b>-52.72.807</td></tr>
         </table>
         <table width="100%" align="center" class="mytable" border="1" cellspacing="1">
           <tr><td width="100%"><b>Δ.Ο.Υ. ΚΑΤΟΙΚΩΝ ΕΞΩΤΕΡΙΚΟΥ Κ.Α.: 1125</b> Μετσόβου 4-T.K.  106 82 Αθήνα</td></tr>
           <tr><td width="8%">Προϊστάμενος</td><td width="8%"><b>213</b> 1607155</td><td width="8%"><b>210</b>- 8204607</td></tr>
           <tr><td width="8%">Υποδιευθυντής Φορολογίας</td><td width="8%">&nbsp;</td><td width="8%"><b>210</b>- 8204604</td></tr>
         </table></body></html>';

$results = qp($html, 'table.mytable');

$xml   = new \SimpleXMLElement('<?xml version="1.0" encoding="UTF-8"?><notes/>');

foreach( $results as $result ) {
  $note = $xml->addChild("note");

  foreach( $result->children('tr') as $idx => $tr ) {
    if( $idx == 0 ) {
      $note->addAttribute("doy", $tr->children('td')->text());
      continue;
    }

    $tds    = $tr->children('td');

    foreach( $tds as $tidx => $td ) {
      if( $tidx == 0 ) {
        $person = $note->addChild("person");
        $person->addAttribute("title", trim($td->text()));

        continue;
      }

      $phoneValue = $td->text();
      $phoneValue = str_replace( array(" ", ".", "-", "\xc2\xa0"), "", $phoneValue );

      if( $phoneValue != '' )
        $phone = $person->addChild("phone", $phoneValue);
    }
  }
}

$dom = dom_import_simplexml($xml)->ownerDocument;
$dom->formatOutput = true;
echo $dom->saveXML();

The output:

<?xml version="1.0" encoding="UTF-8"?>
  <notes>
    <note doy="Δ.Ο.Υ. Α' ΑΘΗΝΩΝ (Α',Β',Γ',ΙΕ',ΚΒ') Κ.Α.: 1101 Αναξαγόρα 6-8, T.K. 100 10 Αθήνα">
      <person title="Προϊστάμενος">
        <phone>2105272810,770</phone>
      </person>
      <person title="Υποδιευθυντής Φορολογίας">
        <phone>2105272804</phone>
      </person>
      <person title="Υποδιευθυντής Ελέγχου">
        <phone>2131604121</phone>
        <phone>2105272807</phone>
      </person>
    </note>
    <note doy="Δ.Ο.Υ. ΚΑΤΟΙΚΩΝ ΕΞΩΤΕΡΙΚΟΥ Κ.Α.: 1125 Μετσόβου 4-T.K.  106 82 Αθήνα">
      <person title="Προϊστάμενος">
        <phone>2131607155</phone>
        <phone>2108204607</phone>
      </person>
      <person title="Υποδιευθυντής Φορολογίας">
        <phone>2108204604</phone>
      </person>
    </note>
  </notes>

Please note: I've wrapped your html code in <html><head><body> tags adding the <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> tag to help querypath identify the encoding. See https://github.com/technosophos/querypath/issues/94 if you need more information. If you insist on creating the XML you've pasted in the question you could change the sample accordingly.

Also, querypath strangely converts &nbsp; to 0xC2 0xA0 (c2a0) (Unicode character no-break-space) (http://www.fileformat.info/info/unicode/char/a0/index.htm) thus the "\xc2\xa0" in the str_replace

Upvotes: 8

Paolo Mioni
Paolo Mioni

Reputation: 998

It can be done with some regular expressions. These will work even if your code is not properly formatted (but your table and td tags must be properly formatted).

// your original string    
$string = <<<heredoc
    <table width="100%" align="center" class="mytable" border="1" cellspacing="1">
      <tr><td width="100%"><b>Δ.Ο.Υ. Α' ΑΘΗΝΩΝ (Α',Β',Γ',ΙΕ',ΚΒ') Κ.Α.: 1101</b> Αναξαγόρα 6-8, T.K. 100 10 Αθήνα</a><a name="aa8inon"></a></td></tr>
        <tr><td width="8%">Προϊστάμενος</td><td width="8%">&nbsp;</td><td width="8%"><b>210</b>-52.72.810, 770</td></tr>
        <tr><td width="8%">Υποδιευθυντής Φορολογίας</td><td width="8%">&nbsp;</td><td width="8%"><b>210</b>-52.72.804</td></tr>
        <tr><td width="8%">Υποδιευθυντής Ελέγχου</td><td width="8%"><b>213</b> 1604121</td><td width="8%"><b>210</b>-52.72.807</td></tr>
    </table>

    <table width="100%" align="center" class="mytable" border="1" cellspacing="1">
      <tr><td width="100%"><b>Δ.Ο.Υ. ΚΑΤΟΙΚΩΝ ΕΞΩΤΕΡΙΚΟΥ Κ.Α.: 1125</b> Μετσόβου 4-T.K.  106 82 Αθήνα</td></tr>
        <tr><td width="8%">Προϊστάμενος</td><td width="8%"><b>213</b> 1607155</td><td width="8%"><b>210</b>- 8204607</td></tr>
        <tr><td width="8%">Υποδιευθυντής Φορολογίας</td><td width="8%">&nbsp;</td><td width="8%"><b>210</b>- 8204604</td></tr>
    </table>

heredoc;

$patternTable = "/<table(.+?)table>/s"; // simple regExp for table tags
$patternTd = '/<td[^>]*>(.+?)<\/td>/s'; // simple regExp for individual tds

$xml = new SimpleXMLElement('<?xml version="1.0" encoding="UTF-8"?><root/>');



preg_match_all($patternTable, $string, $matches); 

for($i=0; $i<sizeof($matches[1]); $i++){   
    $tds  = array();
    $attribute = "";
    $content = "";
    $tagName = "";
    preg_match_all($patternTd,$matches[1][$i], $tds); 
    for($j=0; $j<sizeof($tds[1]); $j++){
        if($j==0){ // first TD, add as attribute of note, taking the CONTENT of the td
            $attribute =  $tds[1][$j];
            $note = $xml->addChild("note");
            $note->addAttribute("doy", $attribute);
        } else { // other tds 
            // there are 3 tds, the first is the name of the tag, the other two the contents
            if($j %3 == 1){
                if($tagName != ""){
                    $note->addChild($tagName, $tagContent);
                    $tagContent = "";
                }
                $tagName = str_replace(" ", "_", $tds[1][$j]);
            } else {
                $tagContent.= $tds[1][$j];
            } 
        }
    }
    $note->addChild($tagName, $tagContent); // add the last opened node


}

$dom = dom_import_simplexml($xml)->ownerDocument;
$dom->formatOutput = true;
echo $dom->saveXML();

The result of this script for me is:

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <note doy="&lt;b&gt;Δ.Ο.Υ. Α' ΑΘΗΝΩΝ (Α',Β',Γ',ΙΕ',ΚΒ') Κ.Α.: 1101&lt;/b&gt; Αναξαγόρα 6-8, T.K. 100 10 Αθήνα&lt;/a&gt;&lt;a name=&quot;aa8inon&quot;&gt;&lt;/a&gt;">
    <Προϊστάμενος>&nbsp;&lt;b&gt;210&lt;/b&gt;-52.72.810, 770</Προϊστάμενος>
    <Υποδιευθυντής_Φορολογίας>&nbsp;&lt;b&gt;210&lt;/b&gt;-52.72.804</Υποδιευθυντής_Φορολογίας>
    <Υποδιευθυντής_Ελέγχου>&lt;b&gt;213&lt;/b&gt; 1604121&lt;b&gt;210&lt;/b&gt;-52.72.807</Υποδιευθυντής_Ελέγχου>
  </note>
  <note doy="&lt;b&gt;Δ.Ο.Υ. ΚΑΤΟΙΚΩΝ ΕΞΩΤΕΡΙΚΟΥ Κ.Α.: 1125&lt;/b&gt; Μετσόβου 4-T.K.  106 82 Αθήνα">
    <Προϊστάμενος>&lt;b&gt;213&lt;/b&gt; 1604121&lt;b&gt;210&lt;/b&gt;-52.72.807&lt;b&gt;213&lt;/b&gt; 1607155&lt;b&gt;210&lt;/b&gt;- 8204607</Προϊστάμενος>
    <Υποδιευθυντής_Φορολογίας>&nbsp;&lt;b&gt;210&lt;/b&gt;- 8204604</Υποδιευθυντής_Φορολογίας>
  </note>
</root>

All of the HTML in the attributes and contents of the tags is escaped, as it's not valid to have tags within content. But if you print it out again, it will preserve your content.

Keep in mind that this solution uses Regular Expressions and both SimpleXML and Dom (for the pretty printing of XML with newlines and indentations) - it will not be very fast in terms of performance. If you want to skip the Dom part, you can just use

echo $xml->asXML()

instead of

$dom = dom_import_simplexml($xml)->ownerDocument;
$dom->formatOutput = true;
echo $dom->saveXML();

Hope this helps.

Upvotes: 0

Fabio Beltramini
Fabio Beltramini

Reputation: 2511

Looking back, I can't tell if your question was for php or javascript, but here is an answer in Javascript. Just save it to an HTML file and load it in a new browser window to see the output.

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<table width="100%" align="center" class="mytable" border="1" cellspacing="1">
  <tr><td width="100%"><b>Δ.Ο.Υ. Α' ΑΘΗΝΩΝ (Α',Β',Γ',ΙΕ',ΚΒ') Κ.Α.: 1101</b> Αναξαγόρα 6-8, T.K. 100 10 Αθήνα</a><a name="aa8inon"></a></td></tr>
    <tr><td width="8%">Προϊστάμενος</td><td width="8%">&nbsp;</td><td width="8%"><b>210</b>-52.72.810, 770</td></tr>
    <tr><td width="8%">Υποδιευθυντής Φορολογίας</td><td width="8%">&nbsp;</td><td width="8%"><b>210</b>-52.72.804</td></tr>
    <tr><td width="8%">Υποδιευθυντής Ελέγχου</td><td width="8%"><b>213</b> 1604121</td><td width="8%"><b>210</b>-52.72.807</td></tr>
</table>
<table width="100%" align="center" class="mytable" border="1" cellspacing="1">
  <tr><td width="100%"><b>Δ.Ο.Υ. ΚΑΤΟΙΚΩΝ ΕΞΩΤΕΡΙΚΟΥ Κ.Α.: 1125</b> Μετσόβου 4-T.K.  106 82 Αθήνα</td></tr>
    <tr><td width="8%">Προϊστάμενος</td><td width="8%"><b>213</b> 1607155</td><td width="8%"><b>210</b>- 8204607</td></tr>
    <tr><td width="8%">Υποδιευθυντής Φορολογίας</td><td width="8%">&nbsp;</td><td width="8%"><b>210</b>- 8204604</td></tr>
</table>
<textarea id="output" rows="24" cols="140"></textarea>
</body>
<script type="text/javascript">
var tables=document.getElementsByTagName("table");
var doc, note, el, elName, txt,txtContent;

doc=document.implementation.createDocument("AnyNamespaceYouWantForYourXML","RootElementName"); //In older versions of IE, I believe you'll have to resort to an ActiveX object
for(var t =0; t<tables.length;t++){
    el=doc.createElement("note");
    note=doc.documentElement.appendChild(el);
    rows=tables[t].getElementsByTagName("tr");
    for(var r=0; r<rows.length; r++){
        var tds=rows[r].getElementsByTagName("td");
        if(r==0){
            note.setAttribute("doy",tds[0].innerHTML); //Unlike in your example output, the real output will have 'special' characters correctly html encoded
        } else {
            elName=tds[0].innerText;
            elName=elName.trim(); //You probably want to discard leading or trailing whitespace
            elName=elName.replace(/[\s]+/g,"_"); //XML element names cannot contain spaces, so replace with underscores
            //There are other rules relating to valid XML element names which you may need to add here. Greek letters should be fine.
            el=doc.createElement(elName);
            //It wasn't clear from your example whether you wanted the xml element to contain the text of the html or some text and a td element
            //The first case seemed more likely, so here it is
            txtContent=" </td>";
            for(var d=1;d<tds.length;d++){
                txtContent+=tds[d].outerHTML;
            }
            txt=doc.createTextNode(txtContent);
            el.appendChild(txt); //Put the text in the element
            note.appendChild(el); //Add the element to the note
        }
    }
}
console.log(doc); //Check the console, you have a useful XML document object
document.getElementById("output").value=xml2Str(doc.documentElement); //Output a string representation


function xml2Str(xmlNode) {
  try {
      // Pretty printing available?
      return XML((new XMLSerializer()).serializeToString(xmlNode)).toXMLString();
  }
  catch (e) {}
  try {
      // Gecko- and Webkit-based browsers (Firefox, Chrome), Opera.
      return (new XMLSerializer()).serializeToString(xmlNode).replace(/<([^\/])/g,"\n<$1");
  }
  catch (e) {}
  try {
     // Internet Explorer.
     return xmlNode.xml.replace(/<([^\/])/g,"<\1");
  }
  catch (e) {}  
  //Other browsers without XML Serializer
  alert('Xmlserializer not supported');
  return false;
}
</script>
</html>

Sample output (indentation added by hand):

<RootElementName xmlns="AnyNamespaceYouWantForYourXML">
<note doy="&lt;b&gt;Δ.Ο.Υ. Α' ΑΘΗΝΩΝ (Α',Β',Γ',ΙΕ',ΚΒ') Κ.Α.: 1101&lt;/b&gt; Αναξαγόρα 6-8, T.K. 100 10 Αθήνα&lt;a name=&quot;aa8inon&quot;&gt;&lt;/a&gt;">
  <Προϊστάμενος> &lt;/td&gt;&lt;td width="8%"&gt;&amp;nbsp;&lt;/td&gt;&lt;td width="8%"&gt;&lt;b&gt;210&lt;/b&gt;-52.72.810, 770&lt;/td&gt;</Προϊστάμενος>
  <Υποδιευθυντής_Φορολογίας> &lt;/td&gt;&lt;td width="8%"&gt;&amp;nbsp;&lt;/td&gt;&lt;td width="8%"&gt;&lt;b&gt;210&lt;/b&gt;-52.72.804&lt;/td&gt;</Υποδιευθυντής_Φορολογίας>
  <Υποδιευθυντής_Ελέγχου> &lt;/td&gt;&lt;td width="8%"&gt;&lt;b&gt;213&lt;/b&gt; 1604121&lt;/td&gt;&lt;td width="8%"&gt;&lt;b&gt;210&lt;/b&gt;-52.72.807&lt;/td&gt;</Υποδιευθυντής_Ελέγχου>
</note>
<note doy="&lt;b&gt;Δ.Ο.Υ. ΚΑΤΟΙΚΩΝ ΕΞΩΤΕΡΙΚΟΥ Κ.Α.: 1125&lt;/b&gt; Μετσόβου 4-T.K.  106 82 Αθήνα">
  <Προϊστάμενος> &lt;/td&gt;&lt;td width="8%"&gt;&lt;b&gt;213&lt;/b&gt; 1607155&lt;/td&gt;&lt;td width="8%"&gt;&lt;b&gt;210&lt;/b&gt;- 8204607&lt;/td&gt;</Προϊστάμενος>
  <Υποδιευθυντής_Φορολογίας> &lt;/td&gt;&lt;td width="8%"&gt;&amp;nbsp;&lt;/td&gt;&lt;td width="8%"&gt;&lt;b&gt;210&lt;/b&gt;- 8204604&lt;/td&gt;</Υποδιευθυντής_Φορολογίας>
</note>
</RootElementName>

[Edit] Things to note:

  1. Your example output was confusing in that it contained unmatched tags (e.g. markup inside your doy attribute and the inside of your Greek-named tags). I tried to interpret your sample output as best I could and converted everything within the attribute and within the Greek-named elements to text. That means that < is represented as & lt; and " as & quot; and ' as & apos; But, another possibility is to surround the markup with <[!CDATA[ ... ]]> to tell the XML interpreter not to parse characters in that area.
  2. While you can name XML elements with Greek characters, note that not all characters are valid or XML element names, so you will have to have some sort of control over what text can appear inside that first cell, or explicitly correct for invalid characters in your code. See http://www.w3schools.com/xml/xml_elements.asp

Upvotes: 2

mico
mico

Reputation: 12738

I am not a php coder myself, apologize for any errors. I used [1] as a reference and made rapid changes to the answer there to come close to what you had as your question:

Code as a rough idea:

<?php


  # Create new DOM object
  $domOb = new DOMDocument();

  # Grab your HTML file
  $html = $domOb->loadHTMLFile(sections.html);

  # Remove whitespace
  $domOb->preserveWhiteSpace = false; 

  # Set the container tag
  $container = $domOb->getElementsByTagName('table'); 

  # Loop through td values
  foreach ($container as $row) 
  { 
      # Grab all <td>
      $items = $row->getElementsByTagName('td'); 
    } 

?>

Evolution to fully answer the question:

With that, taken almost directly from that source [1], $container has all the tables and $items has the <td> element contents.

I suppose you can some php, so it would not be a big trick to do now the following (only pseudo code here, sorry):

1) Take one table item from `$container` with that `foreach`
2) Take first td item, write the needed xml tag `<note doy="`
3) Print td content there
4) Close tag `">`
5) Print the rest of the rows, adding the <td> tags manually to the sides (I suppose this code removes them
6) Add trailing `</node>` tag and iterate to next one on `$container`

Sorry, my php skills equal to zero, try to manage with these, or if somebody else can improve this, feel free to use my beginning as a source and make a new answer. I just want to help @Kaoukkos, not wanting any points if I cannot give the most complete answer and another person can.

What is needed is to not iterate it with foreach but some other way , where you can say do the 2-4 to first row and 5 to the rest content and that's it, folks!

My sources:

[1] Generating XML from HTML list using PHP

Upvotes: 0

Related Questions