Reputation: 2135

Multibyte Character - Pattern matching

I am reading Shift-JIS encoded XML file and store it in ByteBuffer, then convert it into a string and try to find start of a string and end of of a string by Pattern & Matcher. From these 2 positions I try to write buffer to a file. It works when there is no multibyte chars. If there is a multibyte char, I miss some text at the end, since value of end is little off

static final Pattern startPattern = Pattern.compile("<\\?xml ");
static final Pattern endPattern = Pattern.compile("</doc>\n");

 public static void main(String[] args) throws Exception {
    File f = new File("20121114000606JA.xml");
    FileInputStream fis = new FileInputStream(f);
    FileChannel fci = fis.getChannel();
    ByteBuffer data_buffer = ByteBuffer.allocate(65536);
    while (true) {
      int read = fci.read(data_buffer);
      if (read == -1)
        break;
    }

    ByteBuffer cbytes = data_buffer.duplicate();
    cbytes.flip();
    Charset data_charset = Charset.forName("UTF-8");
    String request = data_charset.decode(cbytes).toString();

    Matcher start = startPattern.matcher(request);
    if (start.find()) {
      Matcher end = endPattern.matcher(request);

      if (end.find()) {

        int i0 = start.start();
        int i1 = end.end();

        String str = request.substring(i0, i1);

        String filename = "test.xml";
        FileChannel fc = new FileOutputStream(new File(filename), false).getChannel();

        data_buffer.position(i0);
        data_buffer.limit(i1 - i0);

        long offset = fc.position();
        long sz = fc.write(data_buffer);

        fc.close();
      }
    }
    System.out.println("OK");
  }

Upvotes: 0

Answers (3)

Joop Eggen

Reputation: 109547

Using the String indices i0 and i1 for byte positions in:

data_buffer.position(i0);
data_buffer.limit(i1 - i0);

is erroneous. As UTF-8 does not give a unique encoding, ĉ being written as two characters c + combining diacritical mark ^, back and forth translation between chars and bytes is not only expensive but error prone (in rand cases of specific data).

BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(
        new File(filename)), "UTF-8"));

Or use a CharBuffer, which implements a CharSequence.

Instead of writing to the FileChannel fc:

BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(
        new File(filename)), "UTF-8"));
try {
    out.write(str);
} finally {
    out.close();
}

A CharBuffer version would need more rewriting, also touching the pattern matching.

Upvotes: 1

Christian Trimble

Reputation: 2136

To properly transcode this file, you should use Java's XML APIs. Although there are several ways to do this, here is a solution using the javax.xml.transform package. To start with, we really need the djnml-1.0b.dtd file referenced in your document (in case it contains entity references.) Since this is missing, this solution uses a DTD generated from the supplied input, using Trang:

<?xml encoding="UTF-8"?>

<!ELEMENT doc (djnml)>
<!ATTLIST doc
  xmlns CDATA #FIXED ''
  destination NMTOKEN #REQUIRED
  distId NMTOKEN #REQUIRED
  md5 CDATA #REQUIRED
  msize CDATA #REQUIRED
  sysId NMTOKEN #REQUIRED
  transmission-date NMTOKEN #REQUIRED>

<!ELEMENT djnml (head,body)>
<!ATTLIST djnml
  xmlns CDATA #FIXED ''
  docdate CDATA #REQUIRED
  product NMTOKEN #REQUIRED
  publisher NMTOKEN #REQUIRED
  seq CDATA #REQUIRED
  xml:lang NMTOKEN #REQUIRED>

<!ELEMENT head (copyright,docdata)>
<!ATTLIST head
  xmlns CDATA #FIXED ''>

<!ELEMENT body (headline,text)>
<!ATTLIST body
  xmlns CDATA #FIXED ''>

<!ELEMENT copyright EMPTY>
<!ATTLIST copyright
  xmlns CDATA #FIXED ''
  holder CDATA #REQUIRED
  year CDATA #REQUIRED>

<!ELEMENT docdata (djn)>
<!ATTLIST docdata
  xmlns CDATA #FIXED ''>

<!ELEMENT headline (#PCDATA)>
<!ATTLIST headline
  xmlns CDATA #FIXED ''
  brand-display NMTOKEN #REQUIRED
  prefix CDATA #REQUIRED>

<!ELEMENT text (pre,p+)>
<!ATTLIST text
  xmlns CDATA #FIXED ''>

<!ELEMENT djn (djn-newswires)>
<!ATTLIST djn
  xmlns CDATA #FIXED ''>

<!ELEMENT pre EMPTY>
<!ATTLIST pre
  xmlns CDATA #FIXED ''>

<!ELEMENT p (#PCDATA)>
<!ATTLIST p
  xmlns CDATA #FIXED ''>

<!ELEMENT djn-newswires (djn-press-cutout,djn-urgency,djn-mdata)>
<!ATTLIST djn-newswires
  xmlns CDATA #FIXED ''
  news-source NMTOKEN #REQUIRED
  origin NMTOKEN #REQUIRED
  service-id NMTOKEN #REQUIRED>

<!ELEMENT djn-press-cutout EMPTY>
<!ATTLIST djn-press-cutout
  xmlns CDATA #FIXED ''>

<!ELEMENT djn-urgency (#PCDATA)>
<!ATTLIST djn-urgency
  xmlns CDATA #FIXED ''>

<!ELEMENT djn-mdata (djn-coding)>
<!ATTLIST djn-mdata
  xmlns CDATA #FIXED ''
  accession-number CDATA #REQUIRED
  brand NMTOKEN #REQUIRED
  display-date NMTOKEN #REQUIRED
  hot NMTOKEN #REQUIRED
  original-source NMTOKEN #REQUIRED
  page-citation CDATA #REQUIRED
  retention NMTOKEN #REQUIRED
  temp-perm NMTOKEN #REQUIRED>

<!ELEMENT djn-coding (djn-company,djn-isin,djn-industry,djn-subject,
                      djn-market,djn-product,djn-geo)>
<!ATTLIST djn-coding
  xmlns CDATA #FIXED ''>

<!ELEMENT djn-company (c)>
<!ATTLIST djn-company
  xmlns CDATA #FIXED ''>

<!ELEMENT djn-isin (c)>
<!ATTLIST djn-isin
  xmlns CDATA #FIXED ''>

<!ELEMENT djn-industry (c)+>
<!ATTLIST djn-industry
  xmlns CDATA #FIXED ''>

<!ELEMENT djn-subject (c)+>
<!ATTLIST djn-subject
  xmlns CDATA #FIXED ''>

<!ELEMENT djn-market (c)+>
<!ATTLIST djn-market
  xmlns CDATA #FIXED ''>

<!ELEMENT djn-product (c)+>
<!ATTLIST djn-product
  xmlns CDATA #FIXED ''>

<!ELEMENT djn-geo (c)+>
<!ATTLIST djn-geo
  xmlns CDATA #FIXED ''>

<!ELEMENT c (#PCDATA)>
<!ATTLIST c
  xmlns CDATA #FIXED ''>

After you write this file out to "djnml-1.0b.dtd", we need to create an identity transform using XSLT. You could do this with the newTransformer() method on TransformerFactory, but the results of this transform are not well specified. Using XSLT will produce cleaner results. We will use this file as our identity transform:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" omit-xml-declaration="no"/>

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

Save the above XSLT file as "identity.xsl". Now that we have our DTD and our identity transform, we can transcode the file using this code:

import java.io.Closeable;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.List;

import javax.xml.transform.Templates;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.sax.SAXSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;

import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLReaderFactory;

...

File inFile = new File("20121114000606JA.xml");
File outputFile = new File("test.xml");
final File dtdFile = new File("djnml-1.0b.dtd");
File identityFile = new File("identity.xsl");

final List<Closeable> closeables = new ArrayList<Closeable>();
try {
  // We are going to use a SAXSource for input, so that we can specify the
  // location of the DTD with an EntityResolver.
  InputStream in = new FileInputStream(inFile);
  closeables.add(in);
  InputSource fileSource = new InputSource();
  fileSource.setByteStream(in);
  fileSource.setSystemId(inFile.toURI().toString());

  SAXSource source = new SAXSource();
  XMLReader reader = XMLReaderFactory.createXMLReader();
  reader.setEntityResolver(new EntityResolver() {
    public InputSource resolveEntity(String publicId, String systemId)
        throws SAXException, IOException {
      if (systemId != null && systemId.endsWith("/djnml-1.0b.dtd")) {
        InputStream dtdIn = new FileInputStream(dtdFile);
        closeables.add(dtdIn);

        InputSource inputSource = new InputSource();
        inputSource.setByteStream(dtdIn);
        inputSource.setEncoding("UTF-8");

        return inputSource;
      }
      return null;
    }
  });

  source.setXMLReader(reader);
  source.setInputSource(fileSource);

  // Now we need to create a StreamResult.
  OutputStream out = new FileOutputStream(outputFile);
  closeables.add(out);
  StreamResult result = new StreamResult();
  result.setOutputStream(out);
  result.setSystemId(outputFile);

  // Create a templates object for the identity transform.  If you are going
  // to transform a lot of documents, you should do this once and
  // reuse the Templates object.
  InputStream identityIn = new FileInputStream(identityFile);
  closeables.add(identityIn);
  StreamSource identitySource = new StreamSource();
  identitySource.setSystemId(identityFile);
  identitySource.setInputStream(identityIn);
  TransformerFactory factory = TransformerFactory.newInstance();
  Templates templates = factory.newTemplates(identitySource);

  // Finally we need to create the transformer and do the transformation.
  Transformer transformer = templates.newTransformer();
  transformer.transform(source, result);

} finally {
  // Some older XML processors are bad at cleaning up input and output streams,
  // so we will do this manually.
  for (Closeable closeable : closeables) {
    if (closeable != null) {
      try {
        closeable.close();
      } catch (Exception e) {
      }
    }
  }
}

Upvotes: 0

Christian Trimble

Reputation: 2136

Your problem here seems to be with your decoding of the byte buffer. You are decoding a Shift-JIS ByteBuffer with a UTF-8 CharSet. You need to change that to the Shift-JIS CharSet. These are the supported character encodings.

Although I do not have a Shift-JIS file to test with, you should try changing the CharSet.forName line to:

Charset data_charset = Charset.forName("Shift_JIS");

Also, your regex logic is a little off. I would not use a second matcher, since this causes the search to start over and could lead to a reversed range. Instead, try get the position of the current match and then change the Pattern that your matcher is using:

Matcher matcher = startPattern.matcher(request);
if (matcher.find()) {
  int i0 = matcher.start();
  matcher.usePattern(endPattern);

  if (matcher.find()) {

    int i1 = matcher.end();

Since Shift-JIS is a two byte encoding system, it should cleanly map into Java UTF-8 characters. This should allow you to match this with a single pattern like "START.*END" and just use groups to get your data.

Upvotes: 0

Multibyte Character - Pattern matching

Answers (3)

Related Questions