yottamoto
yottamoto

Reputation: 391

Accessing BLOBS in an Oracle10g Database created with MS Access

I am writing a Java (ok, Groovy, but that doesn't matter here) migration script to copy BLOB fields from an Oracle10g database to another. The data was created by an MS Access application. The files seem to have an incorrect encoding and I am guessing that MS Access or the ODBC driver manipulate the file in some way.

Using the query SELECT * FROM NLS_DATABASE_PARAMETERS WHERE PARAMETER = 'NLS_CHARACTERSET'; I found out that the source database has the characterset WE8MSWIN1252.

The source table is defined as:

CREATE TABLE CTR_DOCUMENTS (
  CTR_ID        NUMBER(11)           NOT NULL,
  CTR_A_ID      NUMBER(11),
  CTR_FILENAME  VARCHAR2(260 Char)   NOT NULL,
  CTR_COMMENT   VARCHAR2(255 Char),
  CTR_DATE      DATE,
  CTR_DATA      BLOB
)

I access the blobs like so:

def blob = sourceDB.firstRow("SELECT CTR_DATA FROM CTR_DOCUMENTS WHERE CTR_ID = ?",
  [id]).CTR_DATA
def blobSize = blob.length()
def blobStream = blob.getBinaryStream()
byte[] byteArray = new byte[blobSize]
blobStream.read(byteArray)

I saved some of the blobs as files, and the encoding looks wierd and the files cannot be opened by their programs. The second byte is always 00:

0000000: 2500 5000 4400 4600 2d00 3100 2e00 3500  %.P.D.F.-.1...5.

I also observe the same behavior accessing the BLOBS with an SQL Client (SQL Workbench/J, SQLDeveloper, TOAD).

For me, it looks like I have to convert the file from Windows-1252 to UTF8, but that does not work. Am I missing something here?

Upvotes: 2

Views: 343

Answers (1)

Gord Thompson
Gord Thompson

Reputation: 123419

Where I would start, for diagnostic purposes if nothing else, would be to scan the byte arrays from a sample of input BLOBs to see if every second byte was in fact 0x00, and writing every other (non-zero) byte to a bytesOut byte array. If successful, I would write the bytesOut arrays to files and see if they are now valid PDF documents. For example:

public static void main(String[] args) {
   try {
        String connectionUrl = ""; 
        connectionUrl = 
                "jdbc:sqlserver://localhost;" +
                "instanceName=SQLEXPRESS;" +
                "databaseName=myDb;" +
                "integratedSecurity=true";
        Connection con = DriverManager.getConnection(connectionUrl);

        String SQL = 
                "SELECT CTR_ID, CTR_FILENAME, CTR_DATA " +
                "FROM CTR_DOCUMENTS " +
                "WHERE CTR_ID BETWEEN 1 AND 5"; 
        Statement stmt = con.createStatement(); 
        ResultSet rs = stmt.executeQuery(SQL);

        while (rs.next()) { 
            boolean writeFile = true; 
            byte[] bytesIn = rs.getBytes("CTR_DATA");
            //scan input byte array and copy every second byte to output byte array
            byte[] bytesOut = new byte[bytesIn.length / 2];
            for (int i = 0; i < bytesIn.length; i++) {
                if ((i % 2) == 1) {
                    if (bytesIn[i] != 0x00) {
                        System.out.println(String.format("Darn. bytesIn value at offset %d is not 0x00. Skipping...", i));
                        writeFile = false;
                        break;
                    }
                }
                else {
                    bytesOut[i / 2] = bytesIn[i];
                }
            }
            if (writeFile) {
                String outFile =
                        "C:\\__tmp\\pdfTest\\" + rs.getString("CTR_FILENAME");
                FileOutputStream fos = new FileOutputStream(outFile);
                fos.write(bytesOut);
                fos.close();
                System.out.println(String.format("\"%s\" created.", outFile));
            }
        }
        rs.close();
        con.close();
   } catch(Exception e) {
        System.out.println(e.getMessage());
        System.exit(0); 
   }
}

The reasoning is that if, somewhere along the way, some process took what it thought was a "string" of single-byte characters (e.g., Windows-1252) and converted to Unicode (e.g., UCS-2LE) by simply inserting 0x00 after every character (without otherwise mangling the actual data bytes), then the most straightforward solution would be to just take those 0x00 bytes out again.

Upvotes: 2

Related Questions