Reputation: 8836
Here are two tables containing data I would like to have in an XML format. The actual thing is more than two with a random number of rows.
<table width="100%" align="center" class="mytable" border="1" cellspacing="1">
<tr><td width="100%"><b>Δ.Ο.Υ. Α' ΑΘΗΝΩΝ (Α',Β',Γ',ΙΕ',ΚΒ') Κ.Α.: 1101</b> Αναξαγόρα 6-8, T.K. 100 10 Αθήνα</a><a name="aa8inon"></a></td></tr>
<tr><td width="8%">Προϊστάμενος</td><td width="8%"> </td><td width="8%"><b>210</b>-52.72.810, 770</td></tr>
<tr><td width="8%">Υποδιευθυντής Φορολογίας</td><td width="8%"> </td><td width="8%"><b>210</b>-52.72.804</td></tr>
<tr><td width="8%">Υποδιευθυντής Ελέγχου</td><td width="8%"><b>213</b> 1604121</td><td width="8%"><b>210</b>-52.72.807</td></tr>
</table>
<table width="100%" align="center" class="mytable" border="1" cellspacing="1">
<tr><td width="100%"><b>Δ.Ο.Υ. ΚΑΤΟΙΚΩΝ ΕΞΩΤΕΡΙΚΟΥ Κ.Α.: 1125</b> Μετσόβου 4-T.K. 106 82 Αθήνα</td></tr>
<tr><td width="8%">Προϊστάμενος</td><td width="8%"><b>213</b> 1607155</td><td width="8%"><b>210</b>- 8204607</td></tr>
<tr><td width="8%">Υποδιευθυντής Φορολογίας</td><td width="8%"> </td><td width="8%"><b>210</b>- 8204604</td></tr>
</table>
The first row below the table tag is the root element and all the other rows are child elements. Please forgive me if I make some mistakes in naming correctly the elements.
For example between the first <tr><td>
you see
<b>Δ.Ο.Υ. Α' ΑΘΗΝΩΝ (Α',Β',Γ',ΙΕ',ΚΒ') Κ.Α.: 1101</b> Αναξαγόρα 6-8, T.K. 100 10 Αθήνα</a><a name="aa8inon"></a>
This would be the attribute name in the root element.
The first <td></td>
of the following rows Προϊστάμενος
is the child element and from the next <td>
until the last </td>
of the <tr>
is the data for this child element.
This is what I would like to have
<note doy="<b>Δ.Ο.Υ. Α' ΑΘΗΝΩΝ (Α',Β',Γ',ΙΕ',ΚΒ') Κ.Α.: 1101</b> Αναξαγόρα 6-8, T.K. 100 10 Αθήνα</a><a name="aa8inon"></a>">
<Προϊστάμενος> </td><td width="8%"><b>210</b>-52.72.810, 770</Προϊστάμενος>
<Υποδιευθυντής Φορολογίας> </td><td width="8%"><b>210</b>-52.72.810, 770</Υποδιευθυντής Φορολογίας>
</note>
Is this possible? Any code is appreciated.
Upvotes: 1
Views: 10167
Reputation: 27104
You can parse valid XHTML as XML and transform it to the desired XML format using an XML stylesheet. Since the HTML isn't valid XHTML, you'll have to tidy it first using a tool, for instance an online tidy site. There's a php library too (with sample code) if you should need to do this at runtime.
I tidied your HTML on that site, and applied the following stylesheet to it:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<xsl:template match="/xhtml:html/xhtml:body">
<xsl:element name="notes">
<xsl:apply-templates />
</xsl:element>
</xsl:template>
<xsl:template match="xhtml:table">
<xsl:element name="note">
<xsl:attribute name="doy">
<xsl:value-of select="xhtml:tr[1]/xhtml:td" />
</xsl:attribute>
<xsl:for-each select="xhtml:tr[position() != 1]">
<xsl:element name="{translate(xhtml:td,' ','_')}">
<xsl:for-each select="xhtml:td[position() != 1]">
<!-- filter out empty / td elements -->
<xsl:if test="normalize-space(translate(.,' ',' '))">
<xsl:element name="τηλέφωνο">
<xsl:value-of select="." />
</xsl:element>
</xsl:if>
</xsl:for-each>
</xsl:element>
</xsl:for-each>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
This yields:
<notes>
<note
doy="Δ.Ο.Υ. Α' ΑΘΗΝΩΝ (Α',Β',Γ',ΙΕ',ΚΒ') Κ.Α.: 1101 Αναξαγόρα 6-8, T.K. 100 10 Αθήνα">
<Προϊστάμενος>
<τηλέφωνο>210-52.72.810, 770</τηλέφωνο>
</Προϊστάμενος>
<Υποδιευθυντής_Φορολογίας>
<τηλέφωνο>210-52.72.804</τηλέφωνο>
</Υποδιευθυντής_Φορολογίας>
<Υποδιευθυντής_Ελέγχου>
<τηλέφωνο>213 1604121</τηλέφωνο>
<τηλέφωνο>210-52.72.807</τηλέφωνο>
</Υποδιευθυντής_Ελέγχου>
</note>
<note
doy="Δ.Ο.Υ. ΚΑΤΟΙΚΩΝ ΕΞΩΤΕΡΙΚΟΥ Κ.Α.: 1125 Μετσόβου 4-T.K. 106 82 Αθήνα">
<Προϊστάμενος>
<τηλέφωνο>213 1607155</τηλέφωνο>
<τηλέφωνο>210- 8204607</τηλέφωνο>
</Προϊστάμενος>
<Υποδιευθυντής_Φορολογίας>
<τηλέφωνο>210- 8204604</τηλέφωνο>
</Υποδιευθυντής_Φορολογίας>
</note>
</notes>
Some notes:
Sample php code:
<?php
$xhtml_file = 'doc.xhtml';
$xsl_file = 'doc.xsl';
$doc = new DOMDocument();
$xsl = new XSLTProcessor();
$doc->load($xsl_file);
$xsl->importStyleSheet($doc);
$doc->load($xhtml_file);
echo $xsl->transformToXML($doc);
?>
Upvotes: 1
Reputation: 5174
First off I should note that the XML you want to output seems to be invalid.
You could make use of the excellent querypath library (http://querypath.org/) and you eventually you could apply the same logic from PHP to Javascript (with JQuery's selector engine)
Here is a piece of code that produces valid XML from your input (btw I am Greek so it makes more sense to me):
libxml_use_internal_errors(true);
$html = '<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /></head><body><table width="100%" align="center" class="mytable" border="1" cellspacing="1">
<tr><td width="100%"><b>Δ.Ο.Υ. Α\' ΑΘΗΝΩΝ (Α\',Β\',Γ\',ΙΕ\',ΚΒ\') Κ.Α.: 1101</b> Αναξαγόρα 6-8, T.K. 100 10 Αθήνα</a><a name="aa8inon"></a></td></tr>
<tr><td width="8%">Προϊστάμενος</td><td width="8%"> </td><td width="8%"><b>210</b>-52.72.810, 770</td></tr>
<tr><td width="8%">Υποδιευθυντής Φορολογίας</td><td width="8%"> </td><td width="8%"><b>210</b>-52.72.804</td></tr>
<tr><td width="8%">Υποδιευθυντής Ελέγχου</td><td width="8%"><b>213</b> 1604121</td><td width="8%"><b>210</b>-52.72.807</td></tr>
</table>
<table width="100%" align="center" class="mytable" border="1" cellspacing="1">
<tr><td width="100%"><b>Δ.Ο.Υ. ΚΑΤΟΙΚΩΝ ΕΞΩΤΕΡΙΚΟΥ Κ.Α.: 1125</b> Μετσόβου 4-T.K. 106 82 Αθήνα</td></tr>
<tr><td width="8%">Προϊστάμενος</td><td width="8%"><b>213</b> 1607155</td><td width="8%"><b>210</b>- 8204607</td></tr>
<tr><td width="8%">Υποδιευθυντής Φορολογίας</td><td width="8%"> </td><td width="8%"><b>210</b>- 8204604</td></tr>
</table></body></html>';
$results = qp($html, 'table.mytable');
$xml = new \SimpleXMLElement('<?xml version="1.0" encoding="UTF-8"?><notes/>');
foreach( $results as $result ) {
$note = $xml->addChild("note");
foreach( $result->children('tr') as $idx => $tr ) {
if( $idx == 0 ) {
$note->addAttribute("doy", $tr->children('td')->text());
continue;
}
$tds = $tr->children('td');
foreach( $tds as $tidx => $td ) {
if( $tidx == 0 ) {
$person = $note->addChild("person");
$person->addAttribute("title", trim($td->text()));
continue;
}
$phoneValue = $td->text();
$phoneValue = str_replace( array(" ", ".", "-", "\xc2\xa0"), "", $phoneValue );
if( $phoneValue != '' )
$phone = $person->addChild("phone", $phoneValue);
}
}
}
$dom = dom_import_simplexml($xml)->ownerDocument;
$dom->formatOutput = true;
echo $dom->saveXML();
The output:
<?xml version="1.0" encoding="UTF-8"?>
<notes>
<note doy="Δ.Ο.Υ. Α' ΑΘΗΝΩΝ (Α',Β',Γ',ΙΕ',ΚΒ') Κ.Α.: 1101 Αναξαγόρα 6-8, T.K. 100 10 Αθήνα">
<person title="Προϊστάμενος">
<phone>2105272810,770</phone>
</person>
<person title="Υποδιευθυντής Φορολογίας">
<phone>2105272804</phone>
</person>
<person title="Υποδιευθυντής Ελέγχου">
<phone>2131604121</phone>
<phone>2105272807</phone>
</person>
</note>
<note doy="Δ.Ο.Υ. ΚΑΤΟΙΚΩΝ ΕΞΩΤΕΡΙΚΟΥ Κ.Α.: 1125 Μετσόβου 4-T.K. 106 82 Αθήνα">
<person title="Προϊστάμενος">
<phone>2131607155</phone>
<phone>2108204607</phone>
</person>
<person title="Υποδιευθυντής Φορολογίας">
<phone>2108204604</phone>
</person>
</note>
</notes>
Please note: I've wrapped your html code in <html><head><body>
tags adding the <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
tag to help querypath identify the encoding. See https://github.com/technosophos/querypath/issues/94 if you need more information.
If you insist on creating the XML you've pasted in the question you could change the sample accordingly.
Also, querypath strangely converts
to 0xC2 0xA0 (c2a0)
(Unicode character no-break-space) (http://www.fileformat.info/info/unicode/char/a0/index.htm) thus the "\xc2\xa0"
in the str_replace
Upvotes: 8
Reputation: 998
It can be done with some regular expressions. These will work even if your code is not properly formatted (but your table and td tags must be properly formatted).
// your original string
$string = <<<heredoc
<table width="100%" align="center" class="mytable" border="1" cellspacing="1">
<tr><td width="100%"><b>Δ.Ο.Υ. Α' ΑΘΗΝΩΝ (Α',Β',Γ',ΙΕ',ΚΒ') Κ.Α.: 1101</b> Αναξαγόρα 6-8, T.K. 100 10 Αθήνα</a><a name="aa8inon"></a></td></tr>
<tr><td width="8%">Προϊστάμενος</td><td width="8%"> </td><td width="8%"><b>210</b>-52.72.810, 770</td></tr>
<tr><td width="8%">Υποδιευθυντής Φορολογίας</td><td width="8%"> </td><td width="8%"><b>210</b>-52.72.804</td></tr>
<tr><td width="8%">Υποδιευθυντής Ελέγχου</td><td width="8%"><b>213</b> 1604121</td><td width="8%"><b>210</b>-52.72.807</td></tr>
</table>
<table width="100%" align="center" class="mytable" border="1" cellspacing="1">
<tr><td width="100%"><b>Δ.Ο.Υ. ΚΑΤΟΙΚΩΝ ΕΞΩΤΕΡΙΚΟΥ Κ.Α.: 1125</b> Μετσόβου 4-T.K. 106 82 Αθήνα</td></tr>
<tr><td width="8%">Προϊστάμενος</td><td width="8%"><b>213</b> 1607155</td><td width="8%"><b>210</b>- 8204607</td></tr>
<tr><td width="8%">Υποδιευθυντής Φορολογίας</td><td width="8%"> </td><td width="8%"><b>210</b>- 8204604</td></tr>
</table>
heredoc;
$patternTable = "/<table(.+?)table>/s"; // simple regExp for table tags
$patternTd = '/<td[^>]*>(.+?)<\/td>/s'; // simple regExp for individual tds
$xml = new SimpleXMLElement('<?xml version="1.0" encoding="UTF-8"?><root/>');
preg_match_all($patternTable, $string, $matches);
for($i=0; $i<sizeof($matches[1]); $i++){
$tds = array();
$attribute = "";
$content = "";
$tagName = "";
preg_match_all($patternTd,$matches[1][$i], $tds);
for($j=0; $j<sizeof($tds[1]); $j++){
if($j==0){ // first TD, add as attribute of note, taking the CONTENT of the td
$attribute = $tds[1][$j];
$note = $xml->addChild("note");
$note->addAttribute("doy", $attribute);
} else { // other tds
// there are 3 tds, the first is the name of the tag, the other two the contents
if($j %3 == 1){
if($tagName != ""){
$note->addChild($tagName, $tagContent);
$tagContent = "";
}
$tagName = str_replace(" ", "_", $tds[1][$j]);
} else {
$tagContent.= $tds[1][$j];
}
}
}
$note->addChild($tagName, $tagContent); // add the last opened node
}
$dom = dom_import_simplexml($xml)->ownerDocument;
$dom->formatOutput = true;
echo $dom->saveXML();
The result of this script for me is:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<note doy="<b>Δ.Ο.Υ. Α' ΑΘΗΝΩΝ (Α',Β',Γ',ΙΕ',ΚΒ') Κ.Α.: 1101</b> Αναξαγόρα 6-8, T.K. 100 10 Αθήνα</a><a name="aa8inon"></a>">
<Προϊστάμενος> <b>210</b>-52.72.810, 770</Προϊστάμενος>
<Υποδιευθυντής_Φορολογίας> <b>210</b>-52.72.804</Υποδιευθυντής_Φορολογίας>
<Υποδιευθυντής_Ελέγχου><b>213</b> 1604121<b>210</b>-52.72.807</Υποδιευθυντής_Ελέγχου>
</note>
<note doy="<b>Δ.Ο.Υ. ΚΑΤΟΙΚΩΝ ΕΞΩΤΕΡΙΚΟΥ Κ.Α.: 1125</b> Μετσόβου 4-T.K. 106 82 Αθήνα">
<Προϊστάμενος><b>213</b> 1604121<b>210</b>-52.72.807<b>213</b> 1607155<b>210</b>- 8204607</Προϊστάμενος>
<Υποδιευθυντής_Φορολογίας> <b>210</b>- 8204604</Υποδιευθυντής_Φορολογίας>
</note>
</root>
All of the HTML in the attributes and contents of the tags is escaped, as it's not valid to have tags within content. But if you print it out again, it will preserve your content.
Keep in mind that this solution uses Regular Expressions and both SimpleXML and Dom (for the pretty printing of XML with newlines and indentations) - it will not be very fast in terms of performance. If you want to skip the Dom part, you can just use
echo $xml->asXML()
instead of
$dom = dom_import_simplexml($xml)->ownerDocument;
$dom->formatOutput = true;
echo $dom->saveXML();
Hope this helps.
Upvotes: 0
Reputation: 2511
Looking back, I can't tell if your question was for php or javascript, but here is an answer in Javascript. Just save it to an HTML file and load it in a new browser window to see the output.
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<table width="100%" align="center" class="mytable" border="1" cellspacing="1">
<tr><td width="100%"><b>Δ.Ο.Υ. Α' ΑΘΗΝΩΝ (Α',Β',Γ',ΙΕ',ΚΒ') Κ.Α.: 1101</b> Αναξαγόρα 6-8, T.K. 100 10 Αθήνα</a><a name="aa8inon"></a></td></tr>
<tr><td width="8%">Προϊστάμενος</td><td width="8%"> </td><td width="8%"><b>210</b>-52.72.810, 770</td></tr>
<tr><td width="8%">Υποδιευθυντής Φορολογίας</td><td width="8%"> </td><td width="8%"><b>210</b>-52.72.804</td></tr>
<tr><td width="8%">Υποδιευθυντής Ελέγχου</td><td width="8%"><b>213</b> 1604121</td><td width="8%"><b>210</b>-52.72.807</td></tr>
</table>
<table width="100%" align="center" class="mytable" border="1" cellspacing="1">
<tr><td width="100%"><b>Δ.Ο.Υ. ΚΑΤΟΙΚΩΝ ΕΞΩΤΕΡΙΚΟΥ Κ.Α.: 1125</b> Μετσόβου 4-T.K. 106 82 Αθήνα</td></tr>
<tr><td width="8%">Προϊστάμενος</td><td width="8%"><b>213</b> 1607155</td><td width="8%"><b>210</b>- 8204607</td></tr>
<tr><td width="8%">Υποδιευθυντής Φορολογίας</td><td width="8%"> </td><td width="8%"><b>210</b>- 8204604</td></tr>
</table>
<textarea id="output" rows="24" cols="140"></textarea>
</body>
<script type="text/javascript">
var tables=document.getElementsByTagName("table");
var doc, note, el, elName, txt,txtContent;
doc=document.implementation.createDocument("AnyNamespaceYouWantForYourXML","RootElementName"); //In older versions of IE, I believe you'll have to resort to an ActiveX object
for(var t =0; t<tables.length;t++){
el=doc.createElement("note");
note=doc.documentElement.appendChild(el);
rows=tables[t].getElementsByTagName("tr");
for(var r=0; r<rows.length; r++){
var tds=rows[r].getElementsByTagName("td");
if(r==0){
note.setAttribute("doy",tds[0].innerHTML); //Unlike in your example output, the real output will have 'special' characters correctly html encoded
} else {
elName=tds[0].innerText;
elName=elName.trim(); //You probably want to discard leading or trailing whitespace
elName=elName.replace(/[\s]+/g,"_"); //XML element names cannot contain spaces, so replace with underscores
//There are other rules relating to valid XML element names which you may need to add here. Greek letters should be fine.
el=doc.createElement(elName);
//It wasn't clear from your example whether you wanted the xml element to contain the text of the html or some text and a td element
//The first case seemed more likely, so here it is
txtContent=" </td>";
for(var d=1;d<tds.length;d++){
txtContent+=tds[d].outerHTML;
}
txt=doc.createTextNode(txtContent);
el.appendChild(txt); //Put the text in the element
note.appendChild(el); //Add the element to the note
}
}
}
console.log(doc); //Check the console, you have a useful XML document object
document.getElementById("output").value=xml2Str(doc.documentElement); //Output a string representation
function xml2Str(xmlNode) {
try {
// Pretty printing available?
return XML((new XMLSerializer()).serializeToString(xmlNode)).toXMLString();
}
catch (e) {}
try {
// Gecko- and Webkit-based browsers (Firefox, Chrome), Opera.
return (new XMLSerializer()).serializeToString(xmlNode).replace(/<([^\/])/g,"\n<$1");
}
catch (e) {}
try {
// Internet Explorer.
return xmlNode.xml.replace(/<([^\/])/g,"<\1");
}
catch (e) {}
//Other browsers without XML Serializer
alert('Xmlserializer not supported');
return false;
}
</script>
</html>
Sample output (indentation added by hand):
<RootElementName xmlns="AnyNamespaceYouWantForYourXML">
<note doy="<b>Δ.Ο.Υ. Α' ΑΘΗΝΩΝ (Α',Β',Γ',ΙΕ',ΚΒ') Κ.Α.: 1101</b> Αναξαγόρα 6-8, T.K. 100 10 Αθήνα<a name="aa8inon"></a>">
<Προϊστάμενος> </td><td width="8%">&nbsp;</td><td width="8%"><b>210</b>-52.72.810, 770</td></Προϊστάμενος>
<Υποδιευθυντής_Φορολογίας> </td><td width="8%">&nbsp;</td><td width="8%"><b>210</b>-52.72.804</td></Υποδιευθυντής_Φορολογίας>
<Υποδιευθυντής_Ελέγχου> </td><td width="8%"><b>213</b> 1604121</td><td width="8%"><b>210</b>-52.72.807</td></Υποδιευθυντής_Ελέγχου>
</note>
<note doy="<b>Δ.Ο.Υ. ΚΑΤΟΙΚΩΝ ΕΞΩΤΕΡΙΚΟΥ Κ.Α.: 1125</b> Μετσόβου 4-T.K. 106 82 Αθήνα">
<Προϊστάμενος> </td><td width="8%"><b>213</b> 1607155</td><td width="8%"><b>210</b>- 8204607</td></Προϊστάμενος>
<Υποδιευθυντής_Φορολογίας> </td><td width="8%">&nbsp;</td><td width="8%"><b>210</b>- 8204604</td></Υποδιευθυντής_Φορολογίας>
</note>
</RootElementName>
[Edit] Things to note:
Upvotes: 2
Reputation: 12738
I am not a php coder myself, apologize for any errors. I used [1] as a reference and made rapid changes to the answer there to come close to what you had as your question:
Code as a rough idea:
<?php
# Create new DOM object
$domOb = new DOMDocument();
# Grab your HTML file
$html = $domOb->loadHTMLFile(sections.html);
# Remove whitespace
$domOb->preserveWhiteSpace = false;
# Set the container tag
$container = $domOb->getElementsByTagName('table');
# Loop through td values
foreach ($container as $row)
{
# Grab all <td>
$items = $row->getElementsByTagName('td');
}
?>
Evolution to fully answer the question:
With that, taken almost directly from that source [1], $container
has all the tables and $items
has the <td>
element contents.
I suppose you can some php, so it would not be a big trick to do now the following (only pseudo code here, sorry):
1) Take one table item from `$container` with that `foreach`
2) Take first td item, write the needed xml tag `<note doy="`
3) Print td content there
4) Close tag `">`
5) Print the rest of the rows, adding the <td> tags manually to the sides (I suppose this code removes them
6) Add trailing `</node>` tag and iterate to next one on `$container`
Sorry, my php skills equal to zero, try to manage with these, or if somebody else can improve this, feel free to use my beginning as a source and make a new answer. I just want to help @Kaoukkos, not wanting any points if I cannot give the most complete answer and another person can.
What is needed is to not iterate it with foreach but some other way , where you can say do the 2-4 to first row and 5 to the rest content and that's it, folks!
My sources:
[1] Generating XML from HTML list using PHP
Upvotes: 0