Reputation: 705
I need to parse external files with the below structure using Node.js.
<ISSUER>
<COMPANY-DATA>
<CONFORMED-NAME>EXACTECH INC
<CIK>000012345
<ASSIGNED-SIC>9999
<IRS-NUMBER>8979898988
<STATE-OF-INCORPORATION>FL
<FISCAL-YEAR-END>1231
</COMPANY-DATA>
<BUSINESS-ADDRESS>
<STREET1>22W 56TH COURT
<CITY>GAINSVILLE
<STATE>FL
<ZIP>32653
<PHONE>999-999-9999
</BUSINESS-ADDRESS>
<MAIL-ADDRESS>
<STREET1>22W 56TH COURT
<CITY>GAINSVILLE
<STATE>FL
<ZIP>32653
</MAIL-ADDRESS>
</ISSUER>
The blocks have closing tags but individual lines do not. How can I add the missing closing tags so that I can parse the XML?
I do not have control over the XML file generation so cannot get it fixed at source.
This is similar to this Java implementation :Parsing XML with no closing tags in Java
Upvotes: 1
Views: 1112
Reputation: 2500
Your data looks like SGML, the superset of XML allowing tag inference/omission. I'm in the process of releasing an SGML parser for JavaScript (for the browser, node.js and other CommonJS platforms) but it's not released yet. For the time being, I suggest to use the venerable OpenSP software, which doesn't have an npm integration package, but which you can easily install on eg. Ubuntu/Debian using sudo apt-get install opensp
, and similar on other Linuxen and on Mac OS via MacPorts.
The OpenSP package contains the osx
command line utility to down-convert SGML to XML. You can use the node child_process
core package to invoke the osx
program, pipe it your SGML data, and grab the XML output produced by it, and then feed the produced XML to the XML parser of your choice in your node app.
SGML and the osx
program must be told to add the omitted end-element tags for CONFORMED-NAME
, CIK
, and the other elements with omitted end-element tags. You do that by prepending a document type declaration (DTD) before your SGML content. In your case, what you supply to the osx
program should look as follows:
<!DOCTYPE ISSUER [
<!ELEMENT ISSUER - -
(COMPANY-DATA,BUSINESS-ADDRESS,MAIL-ADDRESS)>
<!ELEMENT COMPANY-DATA - -
(CONFORMED-NAME,CIK,ASSIGNED-SIC,IRS-NUMBER,
STATE-OF-INCORPORATION,FISCAL-YEAR-END)>
<!ELEMENT (BUSINESS-ADDRESS,MAIL-ADDRESS) - -
(STREET1,CITY,STATE,ZIP,PHONE?)>
<!ELEMENT
(CONFORMED-NAME,CIK,ASSIGNED-SIC,IRS-NUMBER,
STATE-OF-INCORPORATION,FISCAL-YEAR-END,
STREET1,CITY,STATE,ZIP,PHONE) - O (#PCDATA)>
]>
<ISSUER> ... rest of your input data followin here
Crucially, the declaration for the CONFORMED-NAME
, CIK
, and the other field-like elements use - O
(hyphen-minus and letter O) as tag omission indicators, telling SGML that the end-element tags for these elements can be omitted, and will be inserted automatically by the osx
program.
You can read more about the meaning of these declarations on my project page at https://sgmljs.net/docs/sgmlrefman.html .
Update: With the above mentioned SGML package for Node.js having been released for many years now, and with @yumba expressing interest in it, I've slightly updated the DOCTYPE declarations and added a declaration for the PHONE
element. I've also verified the example is parsed as expected. Note though it's strongly recommended to use the official DOCTYPE declarations for your data format (if you have one) rather than the ones I made up based on the given (necessarily very limited) example data.
Anyway, to make this parse on Node.js, install Node.js and the Node.js sgml
package eg. invoke
npm install -g sgml
and then
sgmlproc test.sgm
on the command line, where test.sgm
contains the above SGML text. sgmlproc
will output XML by default on the standard output so actually it's not necessary to give any command line options, but you might want to check the sgmlproc command line reference to see what's available.
Upvotes: 2