Reputation: 143
I am using tabula-py 2.0.4, pandas 1.17.4 on python 3.7. I am trying to read PDF tables to dataframe with tabula.read_pdf
from tabula import read_pdf
fn = "file.pdf"
print(read_pdf(fn, pages='all', multiple_tables=True)[0])
The problem is that the values are read as float instead of string.
I need it to be read as string, so if the value is 20.0000, I know that accuracy is to the fourth decimal. Now it returns 20.0 instead of 20.0000.
The output with above code is
Upvotes: 4
Views: 7453
Reputation: 1253
You need to add a couple of options to tabula.read_pdf
. An example that parses a pdf-file and interprets the columns found differently:
import tabula
print(tabula.environment_info())
fname = ("https://github.com/chezou/tabula-py/raw/master/tests/resources/"
"data.pdf")
# Columns iterpreted as str
col2str = {'dtype': str}
kwargs = {'output_format': 'dataframe',
'pandas_options': col2str,
'stream': True}
df1 = tabula.read_pdf(fname, **kwargs)
print(df1[0].dtypes)
print(df1[0].head())
# Guessing column type
col2val = {'dtype': None}
kwargs = {'output_format': 'dataframe',
'pandas_options': col2val,
'stream': True}
df2 = tabula.read_pdf(fname, **kwargs)
print(df2[0].dtypes)
print(df2[0].head())
With the following output:
Python version:
3.7.6 (default, Jan 8 2020, 13:42:34)
[Clang 4.0.1 (tags/RELEASE_401/final)]
Java version:
openjdk version "13.0.2" 2020-01-14
OpenJDK Runtime Environment (build 13.0.2+8)
OpenJDK 64-Bit Server VM (build 13.0.2+8, mixed mode, sharing)
tabula-py version: 2.0.4
platform: Darwin-19.3.0-x86_64-i386-64bit
uname:
uname_result(system='Darwin', node='MacBook-Pro-10.local', release='19.3.0', version='Darwin Kernel Version 19.3.0: Thu Jan 9 20:58:23 PST 2020; root:xnu-6153.81.5~1/RELEASE_X86_64', machine='x86_64', processor='i386')
linux_distribution: ('Darwin', '19.3.0', '')
mac_ver: ('10.15.3', ('', '', ''), 'x86_64')
None
'pages' argument isn't specified.Will extract only from page 1 by default.
Unnamed: 0 object
mpg object
cyl object
disp object
hp object
drat object
wt object
qsec object
vs object
am object
gear object
carb object
dtype: object
Unnamed: 0 mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
'pages' argument isn't specified.Will extract only from page 1 by default.
Unnamed: 0 object
mpg float64
cyl int64
disp float64
hp int64
drat float64
wt float64
qsec float64
vs int64
am int64
gear int64
carb int64
dtype: object
Unnamed: 0 mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Upvotes: 5