Alex
Alex

Reputation: 11

How can I extract semi structured tables from PDF using pdfplumber

I want to extract semi structured tables from PDF files. I might consider other modules than pdfplumber if they can work better. I need not only table, but sometimes text above table is still a part of the table (for example name of columns sometimes are above table), or table is continued on the other page.

I tried using extract_text_lines() and It works fine. I want to check pdf line by line and if line is a table - I start collecting this data.

def extract_table_from_page(pdf_path, page_number):

    with pdfplumber.open(pdf_path) as pdf:

        page = pdf.pages[page_number]
        lines = page.extract_text_lines()
        for line in lines:
            if 'chars' in line.keys():
                print(line)

Upvotes: 1

Views: 1158

Answers (1)

Jorj McKie
Jorj McKie

Reputation: 3120

Here is a PyMuPDF example of a table having external column headers in a number of different header text rotation angles - including multi-line column headers.

Some of the column names are vertical. enter image description here

Here is a PyMuPDF script which finds and extracts the table, identifies the column names and prints table contents in markdown format (Github-compatible):

import fitz  # PyMuPDF
doc=fitz.open("input.pdf")  # test file
page=doc[0]  # first page having the table
tabs=page.find_tables()  # find tables on page
tab=tabs[0]  # take first table
print(tab.to_markdown())  # print all content in Github-markdown format

|Column1|column2|column3 line 2|column4 line 2|
|---|---|---|---|
|11|22|33|44|
|55|66|77|88|
|99|AA|BB|CC|
|DD|EE|FF||


tab.header.external  # show some table header properties
True

tab.header.names
['Column1', 'column2', 'column3 line 2', 'column4 line 2']

BTW: Other formats are available too, like a Python list of lists or output to pandas DataFrame.

Note: I am a maintainer and the original creator of PyMuPDF.

Upvotes: 1

Related Questions