Reputation: 15
I used the library. First, I output a table from the pdf on certain pages. Then I go through all the rows in the table and add the row data to the sheet.
import
from openpyxl import Workbook
with pdfplumber.open("Pdffile.pdf") as p:
workbook = Workbook() # New blank Excel workbook
sheet = workbook.active # activation sheet
for i in range(4,6): # Traverse 4 pages-6 page
page = p.pages[i]
for row in table: # Traverse all rows
print(row)
sheet.append(row) # Append write data by row
workbook.save("Excel1.xlsx") # Save file named Excel
print("The first%d page PDF Extraction complete" % i) # Prompt extraction progress
But as a result, I get the last specified page, not all of it. I suppose the error is in the raw traversal. But in what way?
It is in the table that all the pages that are needed are displayed. But in raw, it's the last one.
Upvotes: 0
Views: 1287
Reputation: 1748
Editing the answer in response to the comment by the OP. Improved the table extraction by removing the hidden vertical lines. In order to preserve the structure of the table on pages where the leftmost column is missing, you can handle that by adding missing cells to that row on the left before writing to Excel. I have made the assumption that the desired number of columns is 6 and only the leftmost column will be missing. The updated code will become:
import pdfplumber
from openpyxl import Workbook
def keep_visible_lines(obj):
"""If the object is a hidden line, remove it."""
if obj["object_type"] == "rect":
return obj["non_stroking_color"] == 0
return True
with pdfplumber.open("Pdffile.pdf") as pdf:
workbook = Workbook()
sheet = workbook.active
for page in pdf.pages:
page = page.filter(keep_visible_lines)
table = page.extract_table()
for row in table:
sheet.append(([None] * (6 - len(row))) + row)
workbook.save("Excel1.xlsx")
ORIGINAL
The for loop for writing to the sheet is outside the page for loop. The following code will fix the issue.
import pdfplumber
from openpyxl import Workbook
with pdfplumber.open("Pdffile.pdf") as p:
workbook = Workbook() # New blank Excel workbook
sheet = workbook.active # activation sheet
for i in range(4,6): # Traverse 4 pages-6 page
page = p.pages[i]
table = page.extract_table() # Extract table data
print(table)
for row in table: # Traverse all rows
print(row)
sheet.append(row) # Append write data by row
workbook.save("Excel1.xlsx") # Save file named Excel
print("The first%d page PDF Extraction complete" % i) # Prompt extraction progress
Upvotes: 1