Convert PDF to CSV using java

Question

I have tried most of the things on stack overflow and outside

Problem : I have a pdf with contents and tables . I need to parse tables and content as well.

Apis : https://github.com/tabulapdf/tabula-java I am using tabula-java which ignores some contents and contents inside table cells are not seporated proper way.

MY PDF IS having content like this

 DATE :1/1/2018         ABCD                   SCODE:FFFT
                       --ACCEPTED--
    USER:ADMIN         BATCH:RR               EEE
    CON BATCH
    =======================================================================
    MAIN SNO SUB  VALUE DIS %
    R    12   rr1 0125  24.5
            SLNO  DESC  QTY  TOTAL  CODE   FREE
            1     ABD   12   90     BBNEW  -NILL-
            2     XDF   45   55     GHT55  MRP
            3     QWE   08   77     CAT    -NILL-
    =======================================================================
    MAIN SNO SUB  VALUE DIS %
    QW    14   rr2 0122  24.5
            SLNO  DESC  QTY  TOTAL  CODE   FREE
            1     ABD   12   90     BBNEW  -NILL-
            2     XDF   45   55     GHT55  MRP
            3     QWE   08   77     CAT    -NILL-

Tabula code to convert :

public static void toCsv() throws ParseException {
        String commandLineOptions[] = { "-p", "1", "-o", "$csv", };
        CommandLineParser parser = new DefaultParser();
        try {
            CommandLine line = parser.parse(TabulaUtil.buildOptions(), commandLineOptions);
            new TabulaUtil(System.out, line).extractFileInto(
                    new File("/home/sample/firstPage.pdf"),
                    new File("/home/sample/onePage.csv"));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

tabula even supports command line interface

java -jar TabulaJar/tabula-1.0.2-jar-with-dependencies.jar -p all  -o  $csv -b Pdfs

I have tried using -c,--columns of tabula which is takes cells by X coordinates of column boundaries

But the problem is my pdfs content is dynamic. i.e table sizes are changed.

These links in stack overflow and many more dint worked for me.

How to convert PDF to CSV with tabula-py?

How to extract table data from PDF as CSV from the command line?

Convert PDF to Excel in Java

How to convert a pdf file into CSV file?

itext Converting PDF to csv

Parse PDF table and display it as CSV(Java)

I have used pdf box which gives text which is unformatted where i cant read the table content properly.

Is posible to convert pdf with tables to csv/excel using java without loosing content and formatting.

I dont want to use paid libraries .

Convert PDF to CSV using java

Answers (1)

Related Questions