Klemz
Klemz

Reputation: 143

Reading Tables as string from PDF with Tabula

I am using tabula-py 2.0.4, pandas 1.17.4 on python 3.7. I am trying to read PDF tables to dataframe with tabula.read_pdf

from tabula import read_pdf
fn = "file.pdf"
print(read_pdf(fn, pages='all', multiple_tables=True)[0])

The problem is that the values are read as float instead of string.

I need it to be read as string, so if the value is 20.0000, I know that accuracy is to the fourth decimal. Now it returns 20.0 instead of 20.0000.

Input data in PDF looks like enter image description here

The output with above code is

enter image description here

Upvotes: 4

Views: 7453

Answers (1)

FredrikHedman
FredrikHedman

Reputation: 1253

You need to add a couple of options to tabula.read_pdf. An example that parses a pdf-file and interprets the columns found differently:

import tabula

print(tabula.environment_info())

fname = ("https://github.com/chezou/tabula-py/raw/master/tests/resources/"
         "data.pdf")

# Columns iterpreted as str
col2str = {'dtype': str}
kwargs = {'output_format': 'dataframe',
          'pandas_options': col2str,
          'stream': True}
df1 = tabula.read_pdf(fname, **kwargs)

print(df1[0].dtypes)
print(df1[0].head())

# Guessing column type
col2val = {'dtype': None}
kwargs = {'output_format': 'dataframe',
          'pandas_options': col2val,
          'stream': True}
df2 = tabula.read_pdf(fname, **kwargs)

print(df2[0].dtypes)
print(df2[0].head())

With the following output:

Python version:
    3.7.6 (default, Jan  8 2020, 13:42:34) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
Java version:
    openjdk version "13.0.2" 2020-01-14
OpenJDK Runtime Environment (build 13.0.2+8)
OpenJDK 64-Bit Server VM (build 13.0.2+8, mixed mode, sharing)
tabula-py version: 2.0.4
platform: Darwin-19.3.0-x86_64-i386-64bit
uname:
    uname_result(system='Darwin', node='MacBook-Pro-10.local', release='19.3.0', version='Darwin Kernel Version 19.3.0: Thu Jan  9 20:58:23 PST 2020; root:xnu-6153.81.5~1/RELEASE_X86_64', machine='x86_64', processor='i386')
linux_distribution: ('Darwin', '19.3.0', '')
mac_ver: ('10.15.3', ('', '', ''), 'x86_64')

None
'pages' argument isn't specified.Will extract only from page 1 by default.
Unnamed: 0    object
mpg           object
cyl           object
disp          object
hp            object
drat          object
wt            object
qsec          object
vs            object
am            object
gear          object
carb          object
dtype: object
          Unnamed: 0   mpg cyl   disp   hp  drat     wt   qsec vs am gear carb
0          Mazda RX4  21.0   6  160.0  110  3.90  2.620  16.46  0  1    4    4
1      Mazda RX4 Wag  21.0   6  160.0  110  3.90  2.875  17.02  0  1    4    4
2         Datsun 710  22.8   4  108.0   93  3.85  2.320  18.61  1  1    4    1
3     Hornet 4 Drive  21.4   6  258.0  110  3.08  3.215  19.44  1  0    3    1
4  Hornet Sportabout  18.7   8  360.0  175  3.15  3.440  17.02  0  0    3    2
'pages' argument isn't specified.Will extract only from page 1 by default.
Unnamed: 0     object
mpg           float64
cyl             int64
disp          float64
hp              int64
drat          float64
wt            float64
qsec          float64
vs              int64
am              int64
gear            int64
carb            int64
dtype: object
          Unnamed: 0   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
0          Mazda RX4  21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4
1      Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4
2         Datsun 710  22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
3     Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0     3     1
4  Hornet Sportabout  18.7    8  360.0  175  3.15  3.440  17.02   0   0     3     2


Upvotes: 5

Related Questions