Dhar_
Dhar_

Reputation: 71

How to extract data from tables in a pdf using Python?

I need to extract data from tables in multiple PDF's using Python. I have tested both camelot and tabula however neither of them are able to accurately get the data. The tables have some merged cells, cells with mutiple lines of information etc. so both these libraries get confused. Is there a good way of approaching this issue?

Upvotes: 0

Views: 1336

Answers (1)

janreggie
janreggie

Reputation: 434

There may be something wrong with the underlying structure of the table encoded in the PDF if that's the case.

You could use OCR, and do some string/regex manipulation to extract column data from each row. github.com/cseas/ocr-table seems to work. See the input.pdf and output.txt to see if it works with your situation.

Upvotes: 1

Related Questions