Reputation: 11
I want to extract semi structured tables from PDF files. I might consider other modules than pdfplumber if they can work better. I need not only table, but sometimes text above table is still a part of the table (for example name of columns sometimes are above table), or table is continued on the other page.
I tried using extract_text_lines() and It works fine. I want to check pdf line by line and if line is a table - I start collecting this data.
def extract_table_from_page(pdf_path, page_number):
with pdfplumber.open(pdf_path) as pdf:
page = pdf.pages[page_number]
lines = page.extract_text_lines()
for line in lines:
if 'chars' in line.keys():
print(line)
Upvotes: 1
Views: 1158
Reputation: 3120
Here is a PyMuPDF example of a table having external column headers in a number of different header text rotation angles - including multi-line column headers.
Some of the column names are vertical.
Here is a PyMuPDF script which finds and extracts the table, identifies the column names and prints table contents in markdown format (Github-compatible):
import fitz # PyMuPDF
doc=fitz.open("input.pdf") # test file
page=doc[0] # first page having the table
tabs=page.find_tables() # find tables on page
tab=tabs[0] # take first table
print(tab.to_markdown()) # print all content in Github-markdown format
|Column1|column2|column3 line 2|column4 line 2|
|---|---|---|---|
|11|22|33|44|
|55|66|77|88|
|99|AA|BB|CC|
|DD|EE|FF||
tab.header.external # show some table header properties
True
tab.header.names
['Column1', 'column2', 'column3 line 2', 'column4 line 2']
BTW: Other formats are available too, like a Python list of lists or output to pandas DataFrame.
Note: I am a maintainer and the original creator of PyMuPDF.
Upvotes: 1