jessy
jessy

Reputation: 75

How to get table coordinates using python-camelot?

I am trying to parse some pdf files in order to extract some key information.There is number of tables in each pdf that contains a part of these information. So I tried to use camelot to extract tables and I got good results but I want to extract the title of each table because I want to do a mapping for each table with its title. So I tried to get the coordinates of each table using tables[i]._bbox and then add some margin to these coordinates to detect the area of the title of the table(it can be on the top, on the left or on the bottom of the table) as shown in the images : title of table on the left

title of the table on the top

Can anyone tell me how to get the coordinates of the red area containing the title of table from pdf based on the table coordinates using python?

Upvotes: 4

Views: 6791

Answers (1)

Thomas
Thomas

Reputation: 305

You can create the PDF parser directly. For example for Lattice:

parser = Lattice(**kwargs)
for p in pages:
    t = parser.extract_tables(p, suppress_stdout=suppress_stdout,
                                          layout_kwargs=layout_kwargs)
    tables.extend(t)

Then you have access to parser.layout which contains all the components in the page. These components all have bbox (x0, y0, x1, y1) and the extracted tables also have a bbox object. You can find the closest component to the table and extract its text and coordinates. If you don't want to change the way you invoke table extraction in camelot, you can parse the PDF again:

from camelot import utils
layout, dim = utils.get_page_layout(file_name)

Upvotes: 4

Related Questions