Logan McNulty
Logan McNulty

Reputation: 83

Camelot-py not detecting two lines of text in one row

Scraping table data from a .PDF using Camelot-py, and it is not picking up stacked lines of text (refer to rows 9 and 10 below)

1 through 14

Rows 9 and 10 are void of text for account.

https://camelot-py.readthedocs.io/en/master/user/advanced.html#specify-table-areas

Here is the code I have in .ipynb format. The first block is for the first table that pulls as expected, the second is for page 9.

Table

tables= camelot.read_pdf(r'C:\PDFFilePath', pages='9', line_scale=40)
    tables[0].to_csv(r'Loans&Leases')
    camelot.plot(tables[0], kind ='contour')
    plt.show()

Using MatPlotLib, I can see that Camelot is correctly detecting the table area/grid for page 9.

plot table area

plot grid

Here is a Google Drive link to the PDF

Call Report PDF

Any insight would be greatly appreciated.

Upvotes: 2

Views: 3510

Answers (1)

Your code is correct.

If you try to type tables[0].df, this is the output, which is correct:

enter image description here

So, your problem, at the moment of exporting to CSV, is the line break (\n) in the 10th and 11th rows.

A solution can be the following code:

tables= camelot.read_pdf(r'C:\PDFFilePath', pages='9', line_scale=40, strip_text='\n')

With strip_text, you can strip unwanted characters (see official documentation).

Now, if you export the table to CSV, you get: enter image description here

Upvotes: 2

Related Questions