jsanjayce
jsanjayce

Reputation: 352

pdfplumber - Extract table row splitted across multiple pages

Given a pdf(attached) with table row splitted across multiple pages with page break in between. I am trying to extract tabular data in a csv from this pdf using pdfplumber, but am getting this data in separate rows in a csv. Basically I would like to get this data in a single row.

With pdfplumber, is there a way to identify if the row has a horizontal border or not? If this information is available, it could help in merging the rows.

In the attached image, grey colour coded are the cells content.

enter image description here

Upvotes: 0

Views: 1145

Answers (1)

Samkit Jain
Samkit Jain

Reputation: 1748

pdfplumber objects have a top (distance of the top of the character from the top of the page). You can leverage it to know if the last page ends without a border and the first page starts without a border.

If the top value of the bottommost character is more than the top value of the bottommost horizontal line, then it means that the page is ending without the table border. Similarly, if the top of the topmost character is lower than the top of the topmost horizontal line, then it means that the page is starting without the table border. Combining the two, you can deduce whether to merge the rows or not.

Upvotes: -1

Related Questions