Reputation: 1377
I have tried most of the things on stack overflow and outside
Problem : I have a pdf with contents and tables . I need to parse tables and content as well.
Apis :
https://github.com/tabulapdf/tabula-java
I am using tabula-java
which ignores some contents and contents inside table cells are not seporated proper way.
MY PDF IS having content like this
DATE :1/1/2018 ABCD SCODE:FFFT
--ACCEPTED--
USER:ADMIN BATCH:RR EEE
CON BATCH
=======================================================================
MAIN SNO SUB VALUE DIS %
R 12 rr1 0125 24.5
SLNO DESC QTY TOTAL CODE FREE
1 ABD 12 90 BBNEW -NILL-
2 XDF 45 55 GHT55 MRP
3 QWE 08 77 CAT -NILL-
=======================================================================
MAIN SNO SUB VALUE DIS %
QW 14 rr2 0122 24.5
SLNO DESC QTY TOTAL CODE FREE
1 ABD 12 90 BBNEW -NILL-
2 XDF 45 55 GHT55 MRP
3 QWE 08 77 CAT -NILL-
Tabula code to convert :
public static void toCsv() throws ParseException {
String commandLineOptions[] = { "-p", "1", "-o", "$csv", };
CommandLineParser parser = new DefaultParser();
try {
CommandLine line = parser.parse(TabulaUtil.buildOptions(), commandLineOptions);
new TabulaUtil(System.out, line).extractFileInto(
new File("/home/sample/firstPage.pdf"),
new File("/home/sample/onePage.csv"));
} catch (Exception e) {
e.printStackTrace();
}
}
tabula even supports command line interface
java -jar TabulaJar/tabula-1.0.2-jar-with-dependencies.jar -p all -o $csv -b Pdfs
I have tried using -c,--columns <COLUMNS>
of tabula
which is takes cells by X coordinates of column boundaries
But the problem is my pdfs content is dynamic. i.e table sizes are changed.
These links in stack overflow and many more dint worked for me.
How to convert PDF to CSV with tabula-py?
How to extract table data from PDF as CSV from the command line?
How to convert a pdf file into CSV file?
Parse PDF table and display it as CSV(Java)
I have used pdf box which gives text which is unformatted where i cant read the table content properly.
Is posible to convert pdf with tables to csv/excel using java without loosing content and formatting.
I dont want to use paid libraries .
Upvotes: 6
Views: 9867
Reputation: 105
See any example extracting PDF to CSV with Java here: https://github.com/pdftables/java-pdftables-api. Each page is considered indpendently so the dynamic nature of your PDFs should not be an issue. You can use the free trial on their site.
package com.pdftables.examples;
import java.io.File;
import java.util.Arrays;
import java.util.List;
import org.apache.commons.io.FileUtils;
import org.apache.http.HttpEntity;
import org.apache.http.client.config.CookieSpecs;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.entity.mime.MultipartEntityBuilder;
import org.apache.http.entity.mime.content.FileBody;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
public class ConvertToFile {
private static List<String> formats = Arrays.asList(new String[] { "csv", "xml", "xlsx-single", "xlsx-multiple" });
public static void main(String[] args) throws Exception {
if (args.length != 3) {
System.out.println("Command line: <API_KEY> <FORMAT> <PDF filename>");
System.exit(1);
}
final String apiKey = args[0];
final String format = args[1].toLowerCase();
final String pdfFilename = args[2];
if (!formats.contains(format)) {
System.out.println("Invalid output format: \"" + format + "\"");
System.exit(1);
}
// Avoid cookie warning with default cookie configuration
RequestConfig globalConfig = RequestConfig.custom().setCookieSpec(CookieSpecs.STANDARD).build();
File inputFile = new File(pdfFilename);
if (!inputFile.canRead()) {
System.out.println("Can't read input PDF file: \"" + pdfFilename + "\"");
System.exit(1);
}
try (CloseableHttpClient httpclient = HttpClients.custom().setDefaultRequestConfig(globalConfig).build()) {
HttpPost httppost = new HttpPost("https://pdftables.com/api?format=" + format + "&key=" + apiKey);
FileBody fileBody = new FileBody(inputFile);
HttpEntity requestBody = MultipartEntityBuilder.create().addPart("f", fileBody).build();
httppost.setEntity(requestBody);
System.out.println("Sending request");
try (CloseableHttpResponse response = httpclient.execute(httppost)) {
if (response.getStatusLine().getStatusCode() != 200) {
System.out.println(response.getStatusLine());
System.exit(1);
}
HttpEntity resEntity = response.getEntity();
if (resEntity != null) {
final String outputFilename = getOutputFilename(pdfFilename, format.replaceFirst("-.*$", ""));
System.out.println("Writing output to " + outputFilename);
final File outputFile = new File(outputFilename);
FileUtils.copyToFile(resEntity.getContent(), outputFile);
} else {
System.out.println("Error: file missing from response");
System.exit(1);
}
}
}
}
private static String getOutputFilename(String pdfFilename, String suffix) {
if (pdfFilename.length() >= 5 && pdfFilename.toLowerCase().endsWith(".pdf")) {
return pdfFilename.substring(0, pdfFilename.length() - 4) + "." + suffix;
} else {
return pdfFilename + "." + suffix;
}
}
}
Upvotes: 0