Reputation: 352
Given a pdf(attached) with table row splitted across multiple pages with page break in between. I am trying to extract tabular data in a csv from this pdf using pdfplumber, but am getting this data in separate rows in a csv. Basically I would like to get this data in a single row.
With pdfplumber, is there a way to identify if the row has a horizontal border or not? If this information is available, it could help in merging the rows.
In the attached image, grey colour coded are the cells content.
Upvotes: 0
Views: 1145
Reputation: 1748
pdfplumber objects have a top
(distance of the top of the character from the top of the page). You can leverage it to know if the last page ends without a border and the first page starts without a border.
If the top
value of the bottommost character is more than the top
value of the bottommost horizontal line, then it means that the page is ending without the table border. Similarly, if the top
of the topmost character is lower than the top
of the topmost horizontal line, then it means that the page is starting without the table border. Combining the two, you can deduce whether to merge the rows or not.
Upvotes: -1