Reputation: 2135
I am reading Shift-JIS encoded XML file and store it in ByteBuffer, then convert it into a string and try to find start of a string and end of of a string by Pattern & Matcher. From these 2 positions I try to write buffer to a file. It works when there is no multibyte chars. If there is a multibyte char, I miss some text at the end, since value of end is little off
static final Pattern startPattern = Pattern.compile("<\\?xml ");
static final Pattern endPattern = Pattern.compile("</doc>\n");
public static void main(String[] args) throws Exception {
File f = new File("20121114000606JA.xml");
FileInputStream fis = new FileInputStream(f);
FileChannel fci = fis.getChannel();
ByteBuffer data_buffer = ByteBuffer.allocate(65536);
while (true) {
int read = fci.read(data_buffer);
if (read == -1)
break;
}
ByteBuffer cbytes = data_buffer.duplicate();
cbytes.flip();
Charset data_charset = Charset.forName("UTF-8");
String request = data_charset.decode(cbytes).toString();
Matcher start = startPattern.matcher(request);
if (start.find()) {
Matcher end = endPattern.matcher(request);
if (end.find()) {
int i0 = start.start();
int i1 = end.end();
String str = request.substring(i0, i1);
String filename = "test.xml";
FileChannel fc = new FileOutputStream(new File(filename), false).getChannel();
data_buffer.position(i0);
data_buffer.limit(i1 - i0);
long offset = fc.position();
long sz = fc.write(data_buffer);
fc.close();
}
}
System.out.println("OK");
}
Upvotes: 0
Views: 1181
Reputation: 109547
Using the String indices i0 and i1 for byte positions in:
data_buffer.position(i0);
data_buffer.limit(i1 - i0);
is erroneous. As UTF-8 does not give a unique encoding, ĉ
being written as two characters c
+ combining diacritical mark ^
, back and forth translation between chars and bytes is not only expensive but error prone (in rand cases of specific data).
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(
new File(filename)), "UTF-8"));
Or use a CharBuffer, which implements a CharSequence.
Instead of writing to the FileChannel fc:
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(
new File(filename)), "UTF-8"));
try {
out.write(str);
} finally {
out.close();
}
A CharBuffer version would need more rewriting, also touching the pattern matching.
Upvotes: 1
Reputation: 2136
To properly transcode this file, you should use Java's XML APIs. Although there are several ways to do this, here is a solution using the javax.xml.transform package. To start with, we really need the djnml-1.0b.dtd file referenced in your document (in case it contains entity references.) Since this is missing, this solution uses a DTD generated from the supplied input, using Trang:
<?xml encoding="UTF-8"?>
<!ELEMENT doc (djnml)>
<!ATTLIST doc
xmlns CDATA #FIXED ''
destination NMTOKEN #REQUIRED
distId NMTOKEN #REQUIRED
md5 CDATA #REQUIRED
msize CDATA #REQUIRED
sysId NMTOKEN #REQUIRED
transmission-date NMTOKEN #REQUIRED>
<!ELEMENT djnml (head,body)>
<!ATTLIST djnml
xmlns CDATA #FIXED ''
docdate CDATA #REQUIRED
product NMTOKEN #REQUIRED
publisher NMTOKEN #REQUIRED
seq CDATA #REQUIRED
xml:lang NMTOKEN #REQUIRED>
<!ELEMENT head (copyright,docdata)>
<!ATTLIST head
xmlns CDATA #FIXED ''>
<!ELEMENT body (headline,text)>
<!ATTLIST body
xmlns CDATA #FIXED ''>
<!ELEMENT copyright EMPTY>
<!ATTLIST copyright
xmlns CDATA #FIXED ''
holder CDATA #REQUIRED
year CDATA #REQUIRED>
<!ELEMENT docdata (djn)>
<!ATTLIST docdata
xmlns CDATA #FIXED ''>
<!ELEMENT headline (#PCDATA)>
<!ATTLIST headline
xmlns CDATA #FIXED ''
brand-display NMTOKEN #REQUIRED
prefix CDATA #REQUIRED>
<!ELEMENT text (pre,p+)>
<!ATTLIST text
xmlns CDATA #FIXED ''>
<!ELEMENT djn (djn-newswires)>
<!ATTLIST djn
xmlns CDATA #FIXED ''>
<!ELEMENT pre EMPTY>
<!ATTLIST pre
xmlns CDATA #FIXED ''>
<!ELEMENT p (#PCDATA)>
<!ATTLIST p
xmlns CDATA #FIXED ''>
<!ELEMENT djn-newswires (djn-press-cutout,djn-urgency,djn-mdata)>
<!ATTLIST djn-newswires
xmlns CDATA #FIXED ''
news-source NMTOKEN #REQUIRED
origin NMTOKEN #REQUIRED
service-id NMTOKEN #REQUIRED>
<!ELEMENT djn-press-cutout EMPTY>
<!ATTLIST djn-press-cutout
xmlns CDATA #FIXED ''>
<!ELEMENT djn-urgency (#PCDATA)>
<!ATTLIST djn-urgency
xmlns CDATA #FIXED ''>
<!ELEMENT djn-mdata (djn-coding)>
<!ATTLIST djn-mdata
xmlns CDATA #FIXED ''
accession-number CDATA #REQUIRED
brand NMTOKEN #REQUIRED
display-date NMTOKEN #REQUIRED
hot NMTOKEN #REQUIRED
original-source NMTOKEN #REQUIRED
page-citation CDATA #REQUIRED
retention NMTOKEN #REQUIRED
temp-perm NMTOKEN #REQUIRED>
<!ELEMENT djn-coding (djn-company,djn-isin,djn-industry,djn-subject,
djn-market,djn-product,djn-geo)>
<!ATTLIST djn-coding
xmlns CDATA #FIXED ''>
<!ELEMENT djn-company (c)>
<!ATTLIST djn-company
xmlns CDATA #FIXED ''>
<!ELEMENT djn-isin (c)>
<!ATTLIST djn-isin
xmlns CDATA #FIXED ''>
<!ELEMENT djn-industry (c)+>
<!ATTLIST djn-industry
xmlns CDATA #FIXED ''>
<!ELEMENT djn-subject (c)+>
<!ATTLIST djn-subject
xmlns CDATA #FIXED ''>
<!ELEMENT djn-market (c)+>
<!ATTLIST djn-market
xmlns CDATA #FIXED ''>
<!ELEMENT djn-product (c)+>
<!ATTLIST djn-product
xmlns CDATA #FIXED ''>
<!ELEMENT djn-geo (c)+>
<!ATTLIST djn-geo
xmlns CDATA #FIXED ''>
<!ELEMENT c (#PCDATA)>
<!ATTLIST c
xmlns CDATA #FIXED ''>
After you write this file out to "djnml-1.0b.dtd", we need to create an identity transform using XSLT. You could do this with the newTransformer() method on TransformerFactory, but the results of this transform are not well specified. Using XSLT will produce cleaner results. We will use this file as our identity transform:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" omit-xml-declaration="no"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Save the above XSLT file as "identity.xsl". Now that we have our DTD and our identity transform, we can transcode the file using this code:
import java.io.Closeable;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.List;
import javax.xml.transform.Templates;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.sax.SAXSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLReaderFactory;
...
File inFile = new File("20121114000606JA.xml");
File outputFile = new File("test.xml");
final File dtdFile = new File("djnml-1.0b.dtd");
File identityFile = new File("identity.xsl");
final List<Closeable> closeables = new ArrayList<Closeable>();
try {
// We are going to use a SAXSource for input, so that we can specify the
// location of the DTD with an EntityResolver.
InputStream in = new FileInputStream(inFile);
closeables.add(in);
InputSource fileSource = new InputSource();
fileSource.setByteStream(in);
fileSource.setSystemId(inFile.toURI().toString());
SAXSource source = new SAXSource();
XMLReader reader = XMLReaderFactory.createXMLReader();
reader.setEntityResolver(new EntityResolver() {
public InputSource resolveEntity(String publicId, String systemId)
throws SAXException, IOException {
if (systemId != null && systemId.endsWith("/djnml-1.0b.dtd")) {
InputStream dtdIn = new FileInputStream(dtdFile);
closeables.add(dtdIn);
InputSource inputSource = new InputSource();
inputSource.setByteStream(dtdIn);
inputSource.setEncoding("UTF-8");
return inputSource;
}
return null;
}
});
source.setXMLReader(reader);
source.setInputSource(fileSource);
// Now we need to create a StreamResult.
OutputStream out = new FileOutputStream(outputFile);
closeables.add(out);
StreamResult result = new StreamResult();
result.setOutputStream(out);
result.setSystemId(outputFile);
// Create a templates object for the identity transform. If you are going
// to transform a lot of documents, you should do this once and
// reuse the Templates object.
InputStream identityIn = new FileInputStream(identityFile);
closeables.add(identityIn);
StreamSource identitySource = new StreamSource();
identitySource.setSystemId(identityFile);
identitySource.setInputStream(identityIn);
TransformerFactory factory = TransformerFactory.newInstance();
Templates templates = factory.newTemplates(identitySource);
// Finally we need to create the transformer and do the transformation.
Transformer transformer = templates.newTransformer();
transformer.transform(source, result);
} finally {
// Some older XML processors are bad at cleaning up input and output streams,
// so we will do this manually.
for (Closeable closeable : closeables) {
if (closeable != null) {
try {
closeable.close();
} catch (Exception e) {
}
}
}
}
Upvotes: 0
Reputation: 2136
Your problem here seems to be with your decoding of the byte buffer. You are decoding a Shift-JIS ByteBuffer with a UTF-8 CharSet. You need to change that to the Shift-JIS CharSet. These are the supported character encodings.
Although I do not have a Shift-JIS file to test with, you should try changing the CharSet.forName line to:
Charset data_charset = Charset.forName("Shift_JIS");
Also, your regex logic is a little off. I would not use a second matcher, since this causes the search to start over and could lead to a reversed range. Instead, try get the position of the current match and then change the Pattern that your matcher is using:
Matcher matcher = startPattern.matcher(request);
if (matcher.find()) {
int i0 = matcher.start();
matcher.usePattern(endPattern);
if (matcher.find()) {
int i1 = matcher.end();
Since Shift-JIS is a two byte encoding system, it should cleanly map into Java UTF-8 characters. This should allow you to match this with a single pattern like "START.*END" and just use groups to get your data.
Upvotes: 0