How can I extract semi structured tables from PDF using pdfplumber

Question

I want to extract semi structured tables from PDF files. I might consider other modules than pdfplumber if they can work better. I need not only table, but sometimes text above table is still a part of the table (for example name of columns sometimes are above table), or table is continued on the other page.

I tried using extract_text_lines() and It works fine. I want to check pdf line by line and if line is a table - I start collecting this data.

def extract_table_from_page(pdf_path, page_number):

    with pdfplumber.open(pdf_path) as pdf:

        page = pdf.pages[page_number]
        lines = page.extract_text_lines()
        for line in lines:
            if 'chars' in line.keys():
                print(line)

How can I extract semi structured tables from PDF using pdfplumber

Answers (1)

Related Questions