Reputation: 83
Scraping table data from a .PDF using Camelot-py, and it is not picking up stacked lines of text (refer to rows 9 and 10 below)
Rows 9 and 10 are void of text for account.
https://camelot-py.readthedocs.io/en/master/user/advanced.html#specify-table-areas
Here is the code I have in .ipynb format. The first block is for the first table that pulls as expected, the second is for page 9.
tables= camelot.read_pdf(r'C:\PDFFilePath', pages='9', line_scale=40)
tables[0].to_csv(r'Loans&Leases')
camelot.plot(tables[0], kind ='contour')
plt.show()
Using MatPlotLib, I can see that Camelot is correctly detecting the table area/grid for page 9.
Here is a Google Drive link to the PDF
Any insight would be greatly appreciated.
Upvotes: 2
Views: 3510
Reputation: 3536
Your code is correct.
If you try to type tables[0].df
, this is the output, which is correct:
So, your problem, at the moment of exporting to CSV, is the line break (\n) in the 10th and 11th rows.
A solution can be the following code:
tables= camelot.read_pdf(r'C:\PDFFilePath', pages='9', line_scale=40, strip_text='\n')
With strip_text
, you can strip unwanted characters (see official documentation).
Now, if you export the table to CSV, you get:
Upvotes: 2