Reputation: 1901
Im getting a problem with Camelot library
Im extracting data from PDF, my code is running "ok" for previous 23 page, but for this case its failing to parse text/table ending
I suppose the problem is the string is so long reaching table border
Also tried "stream" but got worst results
PDF Source Data
PDF Output LAYOUT
My output parsed is like
"ALT4945\n24 V"
"70\/140 A ALT5860\n12 V\n90 A"
Desired output should be
"ALT4945\n24 V 70\/140 A"
"ALT5860\n12 V\n90 A"
My first code that work correctly for previous page is
tables = camelot.read_pdf("CROSSREFERENCE.pdf", pages=wPAGES, flavor="lattice")
From the website Camelot Doc https://camelot-py.readthedocs.io/en/master/api.html I get that posible configuration on pdf parser.
"" PARAMS for lattice
line_scale (default: 15)
copy_text ((default: None))
shift_text (default: ['l', 't'])
line_tol (default: 2)
joint_tol (default: 2)
threshold_blocksize (default: 15)
threshold_constant (default: -2)
iterations (default: 0)
resolution (default: 300)
"""
Then I get that problem, tried to solve "playing" with more params, but didnt found the winner
tables = camelot.read_pdf("CROSSREFERENCE.pdf", pages=wPAGES, flavor="lattice", split_text=True, resolution=720, line_scale=250, line_tol=3, joint_tol=3, threshold_blocksize=15)
tables = camelot.read_pdf("CROSSREFERENCE.pdf", pages=wPAGES, flavor="lattice", split_text=True, resolution=720, line_scale=250, line_tol=1, joint_tol=1, threshold_blocksize=3)
Can I get some advice about params to avoid that??
Thanks
edit1: PDF source : https://www.siom.it/images/catalogo-motorini-alter.pdf (Page 24)
Upvotes: 3
Views: 4220
Reputation: 1290
tables = camelot.read_pdf('./catalogo-motorini-alter.pdf', pages='24',
flavor='stream', columns=['300'], split_text=True)
The output of tables[0].df
is following:
0 1
0 CATALOGO SIOM ALTERNATORI BOSCH \nBOSCH \nBOSCH \nBOSCH
1 ALT4800\n12 V\n65A ALT4830\n12 V\n70 A
2 IMPIANTO : BOSCH\nCOD.OEM : 0120489186 IMPIANTO : BOSCH\nCOD.OEM : 0120488172
3 APPLICAZIONI :\n OPEL VAUXHALL APPLICAZIONI :\n OPEL VAUXHALL
4 ALT4840\n12 V\n70 A ALT4890\n12 V\n90 A
5 IMPIANTO : BOSCH\nCOD.OEM : 0120488186 IMPIANTO : BOSCH\nCOD.OEM : 0123315500
6 APPLICAZIONI :\n OPEL VAUXHALL APPLICAZIONI :\n IVECO
7 ALT4900\n12 V\n90 A ALT4940\n24 V\n70/140 A
8 IMPIANTO : BOSCH\nCOD.OEM : 0123320009 IMPIANTO : BOSCH\nCOD.OEM : 0120689535
9 APPLICAZIONI :\n AUDI SKODA VW APPLICAZIONI :\n DROGMOLLER KASSBOHRER MERCEDE...
10 ALT4945\n24 V\n70/140 A ALT5860\n12 V\n90 A
11 IMPIANTO : BOSCH\nCOD.OEM : 0120689541 IMPIANTO : BOSCH\nCOD.OEM : 0120450011
12 APPLICAZIONI :\n MAN MERCEDES BENZ APPLICAZIONI :\n CHRYSLER
13 ALT6600\n12 V\n90 A ALT6610\n24 V\n80 A
14 IMPIANTO : BOSCH\nCOD.OEM : 0124325058 IMPIANTO : BOSCH\nCOD.OEM : 0124555001
15 APPLICAZIONI :\n FIAT LANCIA APPLICAZIONI :\n MERCEDES BENZ
16 Pag .24
From the docs it seems that stream
parser fits better than lattice
for the shared document:
Stream
can be used to parse tables that have whitespaces between cells to simulate a table structure.
And for the cases when a stream
parser finds incorrect columns separators you can specify them by hand in columns
argument (details). Then split_text
option says to split text with those columns:)
Although fpbhb criticized scraping PDFs in comments, I would be rather optimistic in your specific case. The document you shared is well structured. So I would definitely try to parse it. But the point of fpbhb still correct that it is heuristic. So additional precautions are required.
I suggest you to use regular expressions to test what you got from camelot
.
You can use the code below as a starting point:
import re
import logging
def test_tables(tables):
# headers
HEADER_L = re.compile('^CATALOGO SIOM ALTERNATORI$')
HEADER_R = re.compile('^BOSCH \nBOSCH \nBOSCH \nBOSCH$')
# main cell rows
CELL_ROWS = [
re.compile('^ALT\d{4,6}?\n(12|14|24|28) ?V\n\d{2,3}(/\d{2,3})? ?A$'),
re.compile('^IMPIANTO : .*?\nCOD.OEM : [\dA]{9,10}$'),
re.compile('^APPLICAZIONI :(\n[A-Z \.-]*)?$')
]
# bottom line should be Pag.##
PAGE = re.compile('^Pag.\d{1,3}$')
for ti, table in enumerate(tables):
rows = table.df.to_numpy()
# test headers
if not HEADER_L.match(rows[0, 0]):
logging.warning('tables[{}].df.iloc[0][0]: HEADER_L != {}'.format(ti, rows[0, 0]))
if not HEADER_R.match(rows[0, 1]):
logging.warning('tables[{}].df.iloc[0][1]: HEADER_R != {}'.format(ti, rows[0, 1]))
# test bottom line
page_str = ''.join(rows[-1])
if not PAGE.match(page_str):
logging.warning('tables[{}].df.iloc[-1]: PAGE != {}'.format(ti, page_str))
# test cells
for idx, row in enumerate(rows[1:-1]):
row_idx = idx % 3
pattern = CELL_ROWS[row_idx]
if not pattern.match(row[0]):
logging.warning('tables[{}].df.iloc[{}][0]: ROW {} != {}'.format(ti, idx+1, row_idx, row[0]))
if not pattern.match(row[1]):
logging.warning('tables[{}].df.iloc[{}][1]: ROW {} != {}'.format(ti, idx+1, row_idx, row[1]))
pages_till_24 = ','.join([str(i) for i in range(1,25)])
tables = camelot.read_pdf('./catalogo-motorini-alter.pdf', pages=pages_till_24,
flavor='stream', columns=['300'], split_text=True)
test_tables(tables)
It gives only one insignificant warning (extra whitespace)
WARNING:root:tables[8].df.iloc[7][1]: ROW 0 != ALT122300
12 V
45 A
Well, It looks like you can be happy, because it seems to work and you have code to test other pages. Good Luck:)
Upvotes: 5