Reputation: 391
I am writing a Java (ok, Groovy, but that doesn't matter here) migration script to copy BLOB fields from an Oracle10g database to another. The data was created by an MS Access application. The files seem to have an incorrect encoding and I am guessing that MS Access or the ODBC driver manipulate the file in some way.
Using the query SELECT * FROM NLS_DATABASE_PARAMETERS WHERE PARAMETER = 'NLS_CHARACTERSET';
I found out that the source database has the characterset WE8MSWIN1252
.
The source table is defined as:
CREATE TABLE CTR_DOCUMENTS (
CTR_ID NUMBER(11) NOT NULL,
CTR_A_ID NUMBER(11),
CTR_FILENAME VARCHAR2(260 Char) NOT NULL,
CTR_COMMENT VARCHAR2(255 Char),
CTR_DATE DATE,
CTR_DATA BLOB
)
I access the blobs like so:
def blob = sourceDB.firstRow("SELECT CTR_DATA FROM CTR_DOCUMENTS WHERE CTR_ID = ?",
[id]).CTR_DATA
def blobSize = blob.length()
def blobStream = blob.getBinaryStream()
byte[] byteArray = new byte[blobSize]
blobStream.read(byteArray)
I saved some of the blobs as files, and the encoding looks wierd and the files cannot be opened by their programs. The second byte is always 00:
0000000: 2500 5000 4400 4600 2d00 3100 2e00 3500 %.P.D.F.-.1...5.
I also observe the same behavior accessing the BLOBS with an SQL Client (SQL Workbench/J, SQLDeveloper, TOAD).
For me, it looks like I have to convert the file from Windows-1252 to UTF8, but that does not work. Am I missing something here?
Upvotes: 2
Views: 343
Reputation: 123419
Where I would start, for diagnostic purposes if nothing else, would be to scan the byte arrays from a sample of input BLOBs to see if every second byte was in fact 0x00
, and writing every other (non-zero) byte to a bytesOut
byte array. If successful, I would write the bytesOut
arrays to files and see if they are now valid PDF documents. For example:
public static void main(String[] args) {
try {
String connectionUrl = "";
connectionUrl =
"jdbc:sqlserver://localhost;" +
"instanceName=SQLEXPRESS;" +
"databaseName=myDb;" +
"integratedSecurity=true";
Connection con = DriverManager.getConnection(connectionUrl);
String SQL =
"SELECT CTR_ID, CTR_FILENAME, CTR_DATA " +
"FROM CTR_DOCUMENTS " +
"WHERE CTR_ID BETWEEN 1 AND 5";
Statement stmt = con.createStatement();
ResultSet rs = stmt.executeQuery(SQL);
while (rs.next()) {
boolean writeFile = true;
byte[] bytesIn = rs.getBytes("CTR_DATA");
//scan input byte array and copy every second byte to output byte array
byte[] bytesOut = new byte[bytesIn.length / 2];
for (int i = 0; i < bytesIn.length; i++) {
if ((i % 2) == 1) {
if (bytesIn[i] != 0x00) {
System.out.println(String.format("Darn. bytesIn value at offset %d is not 0x00. Skipping...", i));
writeFile = false;
break;
}
}
else {
bytesOut[i / 2] = bytesIn[i];
}
}
if (writeFile) {
String outFile =
"C:\\__tmp\\pdfTest\\" + rs.getString("CTR_FILENAME");
FileOutputStream fos = new FileOutputStream(outFile);
fos.write(bytesOut);
fos.close();
System.out.println(String.format("\"%s\" created.", outFile));
}
}
rs.close();
con.close();
} catch(Exception e) {
System.out.println(e.getMessage());
System.exit(0);
}
}
The reasoning is that if, somewhere along the way, some process took what it thought was a "string" of single-byte characters (e.g., Windows-1252) and converted to Unicode (e.g., UCS-2LE) by simply inserting 0x00
after every character (without otherwise mangling the actual data bytes), then the most straightforward solution would be to just take those 0x00
bytes out again.
Upvotes: 2