CuriP
CuriP

Reputation: 83

How to retain newline character while converting the printstream from XSSF XLSX2CSV class to ByteInputStream?

I am using Java SAX parser to read data from Excel (using XSSF XLSX2CSV class) and load it in to Greenplum database. I am using the code from the following link:

http://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/xssf/eventusermodel/XLSX2CSV.java

I capture the PrintStream output from the above code, convert it on ByteInputStream, and then load it in to Postgres(Greenplum) using native bulk load utility - copy - command.

I modified the following in the main method of XLSX2CSV to capture the printstream and convert it in to byte input steam.

ByteArrayOutputStream baos = new ByteArrayOutputStream();
PrintStream ps = new PrintStream(baos, true, "UTF-8");

// The package open is instantaneous, as it should be.
try (OPCPackage p = OPCPackage.open(xlsxFile.getPath(), PackageAccess.READ)) {
XLSX2CSV xlsx2csv = new XLSX2CSV(p, ps, minColumns);
xlsx2csv.process();
System.out.println(ps);
String data = new String(baos.toByteArray(), StandardCharsets.UTF_8);
System.out.println(data);
byte[] bytes = data.getBytes("UTF8");
ByteArrayInputStream orinput = new ByteArrayInputStream(bytes);
String dbURL1 = "jdbc:postgresql://xxxxx:xxxxx/xxxxx";
String user = "xxxxxx";
String pass = "xxxxxx";
Connection GPConnection = DriverManager.getConnection(dbURL1, user, pass);

 Statement  GPsqlStatement = GPConnection.createStatement();
 String GPStgTableTrunc = "truncate test_table";
 GPsqlStatement.execute(GPStgTableTrunc);
 System.out.print("Load to Greenplum starts "+ 
  Calendar.getInstance().getTime() + "\r\n");

 CopyManager copyManager = new CopyManager((BaseConnection) GPConnection);
copyManager.copyIn("copy test_table from stdin csv",orinput);
System.out.print("Load to Greenplum ends "+ 
Calendar.getInstance().getTime() + "\r\n");

However during the conversion to ByteInputStream, the newline seems to be lost and I get the following error while loading in to Greenplum..

ERROR: COPY metadata not found. This probably means that there is a mixture of newline types in the data. Use the NEWLINE keyword in order to resolve this reliably. (seg40 sdw6.gphd.local:1025 pid=101588)

When I print the string 'data', it does seem to have newline and the values are getting printed correctly.. however it doesn't load while doing bulk load in to DB.

How to preserve the newline in the above scenario so that the load happens correctly? Or if there is a way to convert the printsream in to standard input, that works too. Thanks!

Upvotes: 0

Views: 399

Answers (1)

greengreyblue
greengreyblue

Reputation: 389

Try : "\r\n" instead of "\n"

ByteArrayOutputStream output = new ByteArrayOutputStream();
output.write("something\r\n"".getBytes());
output.write("something\r\n"".getBytes());

ByteArrayOutputStream input = new ByteArrayInputStream(output.getBytes());
s3.putStream(input);

Looks similar to this:

ByteArrayOutputStream/InputStream losing newline characters on S3 Import

Adding sample code which has been tried with PrintStream below:

static void printStream() throws Exception {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        PrintStream ps = new PrintStream(baos, true, "UTF-8");
        ps.println("test 1");
        ps.println("test 2");
        ps.println("test 3");
        System.out.print(new String(baos.toByteArray()));
    }

it is printing:

test 1
test 2
test 3

Upvotes: 1

Related Questions