Reputation: 9931
I get a UserWarning: No tables found on page-1
when I try to extract tables from the attached PDF . However, when I looked at the extracted data, some of the column text was merged into a single column.”
I am using Camelot to parse these PDFs
Steps to reproduce: camelot --output m27.csv --format csv stream m27.pdf
Here is a link to PDF that I am trying to parse: https://github.com/tabulapdf/tabula-java/blob/master/src/test/resources/technology/tabula/m27.pdf
Upvotes: 0
Views: 4693
Reputation: 369
A PDF just contains instructions to place a character at an x,y coordinate on a 2-D plane, retaining no knowledge of words, sentences or tables.
Camelot uses PDFMiner under the hood to group characters into words and words into sentences. Sometimes when the characters are too close, PDFMiner can group characters belonging to different words into a single one.
Since the characters in your PDF table are placed very close, they are being merged into a single word and hence Camelot isn't able to detect the columns correctly. You can specify column separators to get the table out in this case. To get the x-coordinates of column separators you can check out the visual debugging guide. Additionally, you can specify split_text=True
to cut the word along the column separators you've specified. Here's the code (I got the x-coordinates by creating a matplotlib plot of the text in the PDF using $ camelot stream -plot text m27.pdf
):
Using CLI:
$ camelot --output m27.csv --format csv -split stream -C 72,95,209,327,442,529,566,606,683 m27.pdf
Using API:
>>> import camelot
>>> tables = camelot.read_pdf('m27.pdf', flavor='stream', columns=['72,95,209,327,442,529,566,606,683'], split_text=True)
Upvotes: 3