Reputation: 559
I have a PDF file containing a table, the format is like this:
Now;I need to extract the data from specific columns from each row to insert in a database. How can I extract the columns I want only with either javascript or python?
I already tried the manual way but that is not sufficient.
I expect to get the raw data put in a variable (array or list).
========================================== UPDATE:
I decided to go with python, the library's name is tabula; I installed it using pip:
pip install tabula-py
You pass the pdf to the library and specify the page of the table. The output of the table in my question looks magically like this:
Upvotes: 3
Views: 13812
Reputation: 79
I used pdfjs-dist to extract the items in a pdf, and apply some rules to identify the table elements. The extracted items not only has the text information, but only has an attribute called "transform" (transformation matrix) that contains coordinates information, which can be also used to identify the table elements.
The first thing is to find the beginning of a table. In many cases the headers are the same so you can utilize those words to find a beginning. The first table element in a row may share the same coordinate which can also gives a clue where a table starts. After the beginning of a table is identified, because all the tables are fixed width, the items can be divided to certain columns. Just pay attention that there may be more than one row in a single cell, so that you'll need to combine them.
Upvotes: 5
Reputation: 648
You could try AWS Textract. It has a feature where it extracts tables gives you the data as a csv/json.
you can look up more about it here
Upvotes: 3