Mohammed Baashar
Mohammed Baashar

Reputation: 559

How to extract data from a table in a PDF file?

I have a PDF file containing a table, the format is like this:

pdf img

Now;I need to extract the data from specific columns from each row to insert in a database. How can I extract the columns I want only with either javascript or python?

I already tried the manual way but that is not sufficient.

I expect to get the raw data put in a variable (array or list).

========================================== UPDATE:

I decided to go with python, the library's name is tabula; I installed it using pip:

pip install tabula-py

You pass the pdf to the library and specify the page of the table. The output of the table in my question looks magically like this:

enter image description here

Upvotes: 3

Views: 13812

Answers (2)

Roy Keane
Roy Keane

Reputation: 79

I used pdfjs-dist to extract the items in a pdf, and apply some rules to identify the table elements. The extracted items not only has the text information, but only has an attribute called "transform" (transformation matrix) that contains coordinates information, which can be also used to identify the table elements.

The first thing is to find the beginning of a table. In many cases the headers are the same so you can utilize those words to find a beginning. The first table element in a row may share the same coordinate which can also gives a clue where a table starts. After the beginning of a table is identified, because all the tables are fixed width, the items can be divided to certain columns. Just pay attention that there may be more than one row in a single cell, so that you'll need to combine them.

Upvotes: 5

koushikmln
koushikmln

Reputation: 648

You could try AWS Textract. It has a feature where it extracts tables gives you the data as a csv/json.

you can look up more about it here

Upvotes: 3

Related Questions