Reputation: 11
This is my example image from pdf file with 75 pages.
Upvotes: 1
Views: 7168
Reputation: 1481
Camelot is a great option for extracting borderless tables. You can use the flavour = stream option for extraction.
tables = camelot.read_pdf('sample.pdf', flavor='stream', edge_tol=500, pages='1-end')
#tables from all your pages will be stored in the tables object
tables[0].df
df.to_csv()
Upvotes: 0
Reputation: 3961
You can do this with Python and the tabula module. Since it is borderless, you can first find the area dynamically with my get_area function (modify pages number etc.):
from tabula import convert_into, convert_into_by_batch, read_pdf
from tabulate import tabulate
def get_area(file):
"""Set and return the area from which to extract data from within a PDF page
by reading the file as JSON, extracting the locations
and expanding these.
"""
tables = read_pdf(file, output_format="json", pages=2, silent=True)
top = tables[0]["top"]
left = tables[0]["left"]
bottom = tables[0]["height"] + top
right = tables[0]["width"] + left
# print(f"{top=}\n{left=}\n{bottom=}\n{right=}")
return [top - 20, left - 20, bottom + 10, right + 10]
Before conversion, check that the format of your first table looks correct:
def inspect_1st_table(file: str):
df = read_pdf(
file,
# output_format="dataframe",
multiple_tables=True,
pages="all",
area=get_area(file),
silent=True, # Suppress all stderr output
)[0]
print(tabulate(df.head()))
Then, use the area to do your table extraction, from pdf to csv:
def convert_pdf_to_csv(file: str):
"""Output all the tables in the PDF to a CSV"""
convert_into(
file,
file[:-3] + "csv",
output_format="csv",
pages="all",
area=get_area(file),
silent=True,
)
In case you need to extract more than 1 table, again start by inspecting them:
def show_tables(file: str):
"""Read pdf into list of DataFrames"""
tables = read_pdf(
file, pages="all", multiple_tables=True, area=get_area(file), silent=True
)
for df in tables:
print(tabulate(df))
And to a batch conversion of all pdf tables to csv format:
def convert_batch(directory: str):
"""convert all PDFs in a directory"""
convert_into_by_batch(directory, output_format="csv", pages="all", silent=True)
Upvotes: 1