Arpit Solanki
Arpit Solanki

Reputation: 9931

No tables found and merged column text when extracting data from this PDF using Camelot

I get a UserWarning: No tables found on page-1 when I try to extract tables from the attached PDF . However, when I looked at the extracted data, some of the column text was merged into a single column.”

enter image description here

I am using Camelot to parse these PDFs

Steps to reproduce: camelot --output m27.csv --format csv stream m27.pdf

Here is a link to PDF that I am trying to parse: https://github.com/tabulapdf/tabula-java/blob/master/src/test/resources/technology/tabula/m27.pdf

Upvotes: 0

Views: 4693

Answers (1)

Vinayak Mehta
Vinayak Mehta

Reputation: 369

A PDF just contains instructions to place a character at an x,y coordinate on a 2-D plane, retaining no knowledge of words, sentences or tables.

Camelot uses PDFMiner under the hood to group characters into words and words into sentences. Sometimes when the characters are too close, PDFMiner can group characters belonging to different words into a single one.

Since the characters in your PDF table are placed very close, they are being merged into a single word and hence Camelot isn't able to detect the columns correctly. You can specify column separators to get the table out in this case. To get the x-coordinates of column separators you can check out the visual debugging guide. Additionally, you can specify split_text=True to cut the word along the column separators you've specified. Here's the code (I got the x-coordinates by creating a matplotlib plot of the text in the PDF using $ camelot stream -plot text m27.pdf):

Using CLI:

$ camelot --output m27.csv --format csv -split stream -C 72,95,209,327,442,529,566,606,683 m27.pdf

Using API:

>>> import camelot
>>> tables = camelot.read_pdf('m27.pdf', flavor='stream', columns=['72,95,209,327,442,529,566,606,683'], split_text=True)

Upvotes: 3

Related Questions