Sergio
Sergio

Reputation: 668

java SAXParser ignore exception and continue parsing

I have a java class that parses an xml file, and writes its content to MySQL. Everything works fine, but the problem is when the xml file contains invalid unicode characters, an exception is thrown and the program stops parsing the file.

My provider sends this xml file on a daily basis with a list of products with its price, quantity etc. and I have no control over this, so invalid characters will always be there.

All I'm trying to do is to catch these errors, ignore them and continue parsing the rest of the xml file.

I've added a try-catch statements on the startElement, endElement and characters methods of the SAXHandler class, however, they don't catch any exception and the execution stops whenever the parser finds an invalid character.

It seems that I can only catch these exceptions from the function who calls the parser:

    try {
        myIS = new FileInputStream(xmlFilePath);
        parser.parse(myIS, handler);
        retValue = true;
    } catch(SAXParseException err) {
        System.out.println("SAXParseException " + err);
    }

However, that's useless in my case, even if the exception tells me where the invalid character is, the execution stops, so the list of products is far from being complete. This list has about 8,000 products and only a couple of invalid characters, however, if the invalid character is in the first 100 products, then all the 7,900 products are not updated in the database. I've also noticed that the endDocument method is not called if an exception occurs.

Somebody asked the same question here some years ago, but didn't get any solution.

I'd really appreciate any ideas or workarounds for this.

Data Sample (as requested):

<Producto>
 <Brand>
  <Description>Epson</Description>
  <ManufacturerId>eps</ManufacturerId>
  <BrandId>eps</BrandId>
  </Brand>
 <New>false</New>
 <OnSale>null</OnSale>
 <Type>Physical</Type>
 <Description>Epson TM T88V - Impresora de recibos - línea térmica - rollo 8 cm - hasta 300 mm/segundo - paralelo, USB</Description>
 <Category>
  <CategoryId>pos</CategoryId>
  <Description>Puntos de Venta</Description>
  <Subcategories>
   <CategoryId>pos.printer</CategoryId>
   <Description>Impresoras para Recibos</Description>
  </Subcategories>
 </Category>
 <InStock>0</InStock>
 <Price>
  <UnitPrice>4865.6042</UnitPrice>
  <CurrencyId>MXN</CurrencyId>
 </Price>
 <Manufacturer>
  <Description>Epson</Description>
  <ManufacturerId>eps</ManufacturerId>
 </Manufacturer>
 <Mpn>C31CA85814</Mpn>
 <Sku>PT910EPS27</Sku>
 <CompilationDate>2020-02-25T12:30:14.6607135Z</CompilationDate>
</Producto>

Upvotes: 2

Views: 1177

Answers (2)

Sergio
Sergio

Reputation: 668

I solved it removing invalid characters of the xml file before processing it.

I couldn't do what I was trying to do (cath error and continue) but this workaround worked.

Upvotes: 1

Michael Kay
Michael Kay

Reputation: 163322

The XML philosophy is that you don't process bad data. If it's not well-formed XML, the parser is supposed to give up, and user applications are supposed to give up. Culturally, this is a reaction against the HTML culture, where it was found that if it's generally expected that data users will tolerate bad data, the consequence is that suppliers will produce bad data.

Standards deliver cost reduction because you can use readily available off-the-shelf tools both for creating valid data and for reading it at the other end. The benefits are totally neutralised if you decide you're going to interchange things that are almost XML but not quite. If you were downloading software you wouldn't put up with it if it didn't compile. So why are you prepared to put up with bad data? Send it back and demand a refund.

Having said that, if the problem is "invalid Unicode characters" then it's possible that it started out as good XML and got corrupted in transit. Find out what went wrong and get it fixed as close to the source of the problem as you can.

Upvotes: 1

Related Questions