Shantanu Paul
Shantanu Paul

Reputation: 705

Adding missing XML closing tags in Javascript

I need to parse external files with the below structure using Node.js.

<ISSUER>
<COMPANY-DATA>
<CONFORMED-NAME>EXACTECH INC
<CIK>000012345
<ASSIGNED-SIC>9999
<IRS-NUMBER>8979898988
<STATE-OF-INCORPORATION>FL
<FISCAL-YEAR-END>1231
</COMPANY-DATA>
<BUSINESS-ADDRESS>
<STREET1>22W 56TH COURT
<CITY>GAINSVILLE
<STATE>FL
<ZIP>32653
<PHONE>999-999-9999
</BUSINESS-ADDRESS>
<MAIL-ADDRESS>
<STREET1>22W 56TH COURT
<CITY>GAINSVILLE
<STATE>FL
<ZIP>32653
</MAIL-ADDRESS>
</ISSUER>

The blocks have closing tags but individual lines do not. How can I add the missing closing tags so that I can parse the XML?

I do not have control over the XML file generation so cannot get it fixed at source.

This is similar to this Java implementation :Parsing XML with no closing tags in Java

Upvotes: 1

Views: 1112

Answers (1)

imhotap
imhotap

Reputation: 2500

Your data looks like SGML, the superset of XML allowing tag inference/omission. I'm in the process of releasing an SGML parser for JavaScript (for the browser, node.js and other CommonJS platforms) but it's not released yet. For the time being, I suggest to use the venerable OpenSP software, which doesn't have an npm integration package, but which you can easily install on eg. Ubuntu/Debian using sudo apt-get install opensp, and similar on other Linuxen and on Mac OS via MacPorts.

The OpenSP package contains the osx command line utility to down-convert SGML to XML. You can use the node child_process core package to invoke the osx program, pipe it your SGML data, and grab the XML output produced by it, and then feed the produced XML to the XML parser of your choice in your node app.

SGML and the osx program must be told to add the omitted end-element tags for CONFORMED-NAME, CIK, and the other elements with omitted end-element tags. You do that by prepending a document type declaration (DTD) before your SGML content. In your case, what you supply to the osx program should look as follows:

<!DOCTYPE ISSUER [
  <!ELEMENT ISSUER - -
    (COMPANY-DATA,BUSINESS-ADDRESS,MAIL-ADDRESS)>
  <!ELEMENT COMPANY-DATA - -
    (CONFORMED-NAME,CIK,ASSIGNED-SIC,IRS-NUMBER,
    STATE-OF-INCORPORATION,FISCAL-YEAR-END)>
  <!ELEMENT (BUSINESS-ADDRESS,MAIL-ADDRESS) - -
    (STREET1,CITY,STATE,ZIP,PHONE?)>
  <!ELEMENT
    (CONFORMED-NAME,CIK,ASSIGNED-SIC,IRS-NUMBER,
    STATE-OF-INCORPORATION,FISCAL-YEAR-END,
    STREET1,CITY,STATE,ZIP,PHONE) - O (#PCDATA)>
]>
<ISSUER> ... rest of your input data followin here

Crucially, the declaration for the CONFORMED-NAME, CIK, and the other field-like elements use - O (hyphen-minus and letter O) as tag omission indicators, telling SGML that the end-element tags for these elements can be omitted, and will be inserted automatically by the osx program.

You can read more about the meaning of these declarations on my project page at https://sgmljs.net/docs/sgmlrefman.html .

Update: With the above mentioned SGML package for Node.js having been released for many years now, and with @yumba expressing interest in it, I've slightly updated the DOCTYPE declarations and added a declaration for the PHONE element. I've also verified the example is parsed as expected. Note though it's strongly recommended to use the official DOCTYPE declarations for your data format (if you have one) rather than the ones I made up based on the given (necessarily very limited) example data.

Anyway, to make this parse on Node.js, install Node.js and the Node.js sgml package eg. invoke

npm install -g sgml

and then

sgmlproc test.sgm

on the command line, where test.sgm contains the above SGML text. sgmlproc will output XML by default on the standard output so actually it's not necessary to give any command line options, but you might want to check the sgmlproc command line reference to see what's available.

Upvotes: 2

Related Questions