Shiva Krishna Bavandla
Shiva Krishna Bavandla

Reputation: 26668

How to ignore   and special characters from xml tag before giving xml file to parser

Hi, presently I am using xml.sax.handler to parse xml files.

Below is my file.xml code:

<?xml version="1.0" encoding="utf-8"?>
<sturp>
  <gear>
   <UL>
   <LI><I>Free Private Housing or a Generous Housing Allowance</I></LI>
   <LI><I>$50K in Free Life Insurance coverage</I></LI>
   </UL>
   <P style="MARGIN: 0in 0in 0pt" class="MsoNormal"><FONT size="3"><SPAN style="FONT-FAMILY: Symbol; COLOR: black; mso-ascii-font-family: 'Times New Roman'">�</SPAN><SPAN style="COLOR: black"><FONT face="Times New Roman"><SPAN style="mso-spacerun: yes">&nbsp; </SPAN>Position will manage 24 ED Rooms with 24/7 accountability<o:p></o:p></FONT></SPAN></FONT></P>
   <DIV>&nbsp;</DIV>
  </gear> 
</sturp>

below is my code

xmlFilePath = 'user/documents/file.xml'

try:
    parser = xml.sax.make_parser( )
    handler = FeedHandler( conn, clientSiteId, clientId, documentElementName, jobElementName )
    handler.setMapping( mapping )
    parser.setContentHandler(handler)
    parser.setEntityResolver(handler)

    parser.parse(open(xmlFilePath))

except (xml.sax.SAXParseException), e:
        print "*** PARSER error: %s" % e

output:

*** PARSER error: user/documents/file.xml:8:150: not well-formed <invalid token>
*** PARSER error: user/documents/file.xml:9:1: not well-formed <invalid token>

Actually the source xml file given to me is not in valid xml format, but i need to parse it. How to ignore &nbsp; and � from the xml file (also should escape all the errors and non valid xml tokens) before feeding it to the parser in the above code.

Thanks in advance........

Upvotes: 1

Views: 2537

Answers (3)

pat34515
pat34515

Reputation: 1979

If you're simply looking to replace &[a-z]+; entities from your input, you could use my hacked up solution below. But note, you should still give the parser a valid xml file, if you want it to work correctly.

import os, re

For the parser:

def ignore_open( p ):
  temf = 'temp_file'
  with open(temf,'wt') as temp:
    o = open(p,'r')
    temp.write(re.sub("\&[^\;]+;",'', o.read()))
  rs = open(temf)
  os.unlink(temf)
  return rs

Result

>>> parser.parse(ignore_open(xmlFilePath))

Untested code.

Upvotes: 2

Michael Kay
Michael Kay

Reputation: 163322

You say you are parsing XML files, but you are wrong. You are parsing non-XML files. XML parsers are designed to parse XML, and if you give them non-XML they will rightly complain.

If you want your system to handle messages in a non-XML format, then the first thing to do is abandon all mention of XML from your system description and all thoughts of using XML tools to do the parsing. You don't have to use XML in your system, but there is absolutely no point in using something that is almost-XML-but-not quite.

The alternative is to change the program that is generating these messages so that it produces proper well-formed XML.

Upvotes: 2

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

XML makes mostly sense when your files are valid.

This is not a valid XML file, and your parser is correct to stop. For example entities such as &nbsp; must be defined. So your file should have a document type. That is not just for fun, but the document type actually defines entities and such.

If you want a best-effort robust and tolerant parser, I recommend looking at beautifulsoup. It can parse most HTML and XML-like files, without requiring everything to be completely defined. It's still not valid XML then, but it is usable in situations where e.g. users screw your data files.

Removing characters from a file is a HACK and bound to break sooner or later. I cannot recommend doing this.

Upvotes: 1

Related Questions