ericlighthofmann
ericlighthofmann

Reputation: 21

Split a PDF file into two columns along a certain measurement in Python?

I have a ton of PDF files that are laid out in two columns. When I use PyPDF2 to extract the text, it reads the entire first column (which are like headers) and the entire second column. This makes splitting on the headers impossible. It's laid out in two columns:

____ __________
|Col1 Col2 |
|Col1 Col2 |
|Col1 Col2 |
|Col1 Col2 |
____ __________

I think I need to split the PDF in half along the edge of the column, then read each column left to right. It's 2.26 inches width on an 8x11 PDF. I can also get the coordinates using PyPDF2.

Does anyone have any experience doing this or know how I would do it?

Edit: When I extractText using PyPDF2, the ouput has no spaces: Col1Col1Col1Col1Col2Col2Col2Col2

Upvotes: 1

Views: 1384

Answers (1)

ericlighthofmann
ericlighthofmann

Reputation: 21

Using pdfminer.six successfully read from left to right with spaces in between.

Upvotes: 1

Related Questions