nobody
nobody

Reputation: 15

Convert pdf file to xlsx in python

I used the library. First, I output a table from the pdf on certain pages. Then I go through all the rows in the table and add the row data to the sheet.

    import 
from openpyxl import Workbook

with pdfplumber.open("Pdffile.pdf") as p:
    workbook = Workbook()  # New blank Excel workbook
    sheet = workbook.active  # activation sheet
    for i in range(4,6):  # Traverse 4 pages-6 page
        page = p.pages[i]
    for row in table:  # Traverse all rows
        print(row)
        sheet.append(row)  # Append write data by row
    workbook.save("Excel1.xlsx")  # Save file named Excel
    print("The first%d page PDF Extraction complete" % i)  # Prompt extraction progress

But as a result, I get the last specified page, not all of it. I suppose the error is in the raw traversal. But in what way?

It is in the table that all the pages that are needed are displayed. But in raw, it's the last one.

Upvotes: 0

Views: 1287

Answers (1)

Samkit Jain
Samkit Jain

Reputation: 1748

Editing the answer in response to the comment by the OP. Improved the table extraction by removing the hidden vertical lines. In order to preserve the structure of the table on pages where the leftmost column is missing, you can handle that by adding missing cells to that row on the left before writing to Excel. I have made the assumption that the desired number of columns is 6 and only the leftmost column will be missing. The updated code will become:

import pdfplumber
from openpyxl import Workbook


def keep_visible_lines(obj):
    """If the object is a hidden line, remove it."""
    if obj["object_type"] == "rect":
        return obj["non_stroking_color"] == 0

    return True


with pdfplumber.open("Pdffile.pdf") as pdf:
    workbook = Workbook()
    sheet = workbook.active

    for page in pdf.pages:
        page = page.filter(keep_visible_lines)
        table = page.extract_table()

        for row in table:
            sheet.append(([None] * (6 - len(row))) + row)

        workbook.save("Excel1.xlsx")

ORIGINAL

The for loop for writing to the sheet is outside the page for loop. The following code will fix the issue.

import pdfplumber
from openpyxl import Workbook

with pdfplumber.open("Pdffile.pdf") as p:
    workbook = Workbook()  # New blank Excel workbook
    sheet = workbook.active  # activation sheet
    for i in range(4,6):  # Traverse 4 pages-6 page
        page = p.pages[i]
        table = page.extract_table()  # Extract table data
        print(table)
        for row in table:  # Traverse all rows
            print(row)
            sheet.append(row)  # Append write data by row
        workbook.save("Excel1.xlsx")  # Save file named Excel
        print("The first%d page PDF Extraction complete" % i)  # Prompt extraction progress

Upvotes: 1

Related Questions