Reputation: 75
I am trying to parse some pdf files in order to extract some key information.There is number of tables in each pdf that contains a part of these information. So I tried to use camelot to extract tables and I got good results but I want to extract the title of each table because I want to do a mapping for each table with its title. So I tried to get the coordinates of each table using tables[i]._bbox
and then add some margin to these coordinates to detect the area of the title of the table(it can be on the top, on the left or on the bottom of the table) as shown in the images : title of table on the left
Can anyone tell me how to get the coordinates of the red area containing the title of table from pdf based on the table coordinates using python?
Upvotes: 4
Views: 6791
Reputation: 305
You can create the PDF parser directly. For example for Lattice
:
parser = Lattice(**kwargs)
for p in pages:
t = parser.extract_tables(p, suppress_stdout=suppress_stdout,
layout_kwargs=layout_kwargs)
tables.extend(t)
Then you have access to parser.layout
which contains all the components in the page. These components all have bbox (x0, y0, x1, y1)
and the extracted tables also have a bbox
object. You can find the closest component to the table and extract its text and coordinates.
If you don't want to change the way you invoke table extraction in camelot, you can parse the PDF again:
from camelot import utils
layout, dim = utils.get_page_layout(file_name)
Upvotes: 4