ccsv
ccsv

Reputation: 8659

Opening a pdf and reading in tables with python pandas

Is it possible to open PDFs and read it in using python pandas or do I have to use the pandas clipboard for this function?

Upvotes: 44

Views: 158673

Answers (7)

zzhapar
zzhapar

Reputation: 135

I use Tabula library for install, via:

pip install tabula-py

reading several tables inside PDF by link , example:

import tabula
df = tabula.io.read_pdf(url, pages='all')

then you will get many tables, you can call it by using index, it's like printing element from list, Example:

# ex
df[0]

more info here - https://pypi.org/project/tabula-py/

Upvotes: 1

Mark
Mark

Reputation: 984

There is a new version of tabula called tabula-py

pip install tabula-py

the .read_pdf method works just like in the old version, documentation is here: https://pypi.org/project/tabula-py/

Upvotes: 19

joselquin
joselquin

Reputation: 173

I have been doing some tests with Camelot (https://camelot-py.readthedocs.io/en/master/), and it works very good in many situations. And you can try to adjust some parameters if the default ones doesn't work.

It's similar to Tabula, but it use different algorithms (Tabula use the vector data in the PDF and raster the lines of the table; Camelot uses Hough Transform), so you can try both to find the best one.

Both have a web version, so you can try with some example to decide which is the best one for your application.

Upvotes: 9

Isac Junior
Isac Junior

Reputation: 346

you can use tabula https://blog.chezo.uno/tabula-py-extract-table-from-pdf-into-python-dataframe-6c7acfa5f302

from tabula import read_pdf
df = read_pdf('data.pdf')

I can see more in the link!

Upvotes: 30

JMM
JMM

Reputation: 87

Copy the table data from a PDF and paste into an Excel file (which usually gets pasted as a single rather than multiple columns). Then use FlashFill (available in Excel 2016, not sure about earlier Excel versions) to separate the data into the columns originally viewed in the PDF. The process is fast and easy. Then use Pandas to wrangle the Excel data.

Upvotes: 3

Matija Han
Matija Han

Reputation: 482

In case it is a one-off, you can copy the data from your PDF table into a text file, format it (using search-and-replace, Notepad++ macros, a script), save it as a CSV file and load it into Pandas.

If you need to do this in a scalable way, you might try this product: http://tabula.technology/. I have not used it yet, so I don't know how well it works, but you can explore it if you need it.

Upvotes: 8

Daniel
Daniel

Reputation: 42748

this is not possible. PDF is a data format for printing. The table structure is therefor lost. with some luck you can extract the text with pypdf and guess the former table columns.

Upvotes: 4

Related Questions