Python tabula-py error (pandas error?)

Question

After some reading online I have decided to use tabula-py to extract tables from pdf files. We use Anaconda and I just installed tabula-py 1.1.1.

I wanted to start out with a simple script and see what it would do with a single page pdf file with some text and two tables ("table_p16.pdf").

The code:

from tabula import read_pdf
df = read_pdf("table_p16.pdf")

The error:

Picked up JAVA_TOOL_OPTIONS: -Djava.security.properties=c:\Windows\Sun\Java\Deployment\sam.security

Traceback (most recent call last):

File "H:/Personlich/SVN/blademat_tb/blademat_toolbox/utility/read_pdf.py", line 41, in df = read_pdf("table_p16.pdf")

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs est_env\lib\site-packages abula\wrapper.py", line 117, in read_pdf return pd.read_csv(io.BytesIO(output), **pandas_options)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs est_env\lib\site-packages\pandas\io\parsers.py", line 709, in parser_f return _read(filepath_or_buffer, kwds)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs est_env\lib\site-packages\pandas\io\parsers.py", line 455, in _read data = parser.read(nrows)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs est_env\lib\site-packages\pandas\io\parsers.py", line 1069, in read ret = self._engine.read(nrows)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs est_env\lib\site-packages\pandas\io\parsers.py", line 1839, in read data = self._reader.read(nrows)

File "pandas/_libs/parsers.pyx", line 902, in pandas._libs.parsers.TextReader.read

File "pandas/_libs/parsers.pyx", line 924, in pandas._libs.parsers.TextReader._read_low_memory

File "pandas/_libs/parsers.pyx", line 978, in pandas._libs.parsers.TextReader._read_rows

File "pandas/_libs/parsers.pyx", line 965, in pandas._libs.parsers.TextReader._tokenize_rows

File "pandas/_libs/parsers.pyx", line 2208, in pandas._libs.parsers.raise_parser_error

pandas.errors.ParserError: Error tokenizing data. C error: Expected 8 fields in line 9, saw 9

Things I have tried:

Since the error seems to show problems with pandas I tried to read a single page pdf with one table. The same error holds.
Set user variable PATH to Java. Did not change anything. Can't set system variable PATH to Java, since it is currently used for our SVN programm.

Different code lines, with the same error:

df = read_pdf(r"table_p9.pdf")
df = read_pdf(r"table_p9.pdf")
df = read_pdf("table_p9.pdf", output_format='json')

I hope someone can chip in and help me figure out where the problem lies. It could be a Java issue, but I am not that familiar with the required Java interaction. Your help is much appriciated.

Edit

I tried different tables and some seem to be working. It has been difficult to identify what type of tables work. Some with 'merged' columns and others with 'merged' rows seem to work. But clearly not all. Also, I have not been able to read multiple tables (2 or 3) using the argument multiple_tables=True.

Is there any source to what kind of tables Tabula can handle? And this makes me wonder whether Tabula is the right program to use. After all the reading I did, I was under the impression that Tabula would be good at this. The tables it seems to struggle with are not complex.

Is there a clear and simple source on how to maximize the use of Tabula? Or otherwise tips on how to deal with tables that Tabula struggles with?

Regards, Gabriel

Python tabula-py error (pandas error?)

Answers (1)

Related Questions