Wonka
Wonka

Reputation: 1901

Camelot Pdf Extraction FAIL parsing

Im getting a problem with Camelot library

Im extracting data from PDF, my code is running "ok" for previous 23 page, but for this case its failing to parse text/table ending

I suppose the problem is the string is so long reaching table border

Also tried "stream" but got worst results

PDF Source Data

pdf

PDF Output LAYOUT

layout

My output parsed is like

"ALT4945\n24 V"
"70\/140 A   ALT5860\n12 V\n90 A"

Desired output should be

"ALT4945\n24 V 70\/140 A"
"ALT5860\n12 V\n90 A"

My first code that work correctly for previous page is

tables = camelot.read_pdf("CROSSREFERENCE.pdf", pages=wPAGES, flavor="lattice")

From the website Camelot Doc https://camelot-py.readthedocs.io/en/master/api.html I get that posible configuration on pdf parser.

"" PARAMS for lattice
line_scale  (default: 15)
copy_text   ((default: None))
shift_text  (default: ['l', 't'])
line_tol    (default: 2)
joint_tol   (default: 2)
threshold_blocksize   (default: 15)
threshold_constant    (default: -2)
iterations   (default: 0)
resolution   (default: 300)
"""

Then I get that problem, tried to solve "playing" with more params, but didnt found the winner

tables = camelot.read_pdf("CROSSREFERENCE.pdf", pages=wPAGES, flavor="lattice", split_text=True, resolution=720, line_scale=250, line_tol=3, joint_tol=3, threshold_blocksize=15)

tables = camelot.read_pdf("CROSSREFERENCE.pdf", pages=wPAGES, flavor="lattice", split_text=True, resolution=720, line_scale=250, line_tol=1, joint_tol=1, threshold_blocksize=3)

Can I get some advice about params to avoid that??

Thanks

edit1: PDF source : https://www.siom.it/images/catalogo-motorini-alter.pdf (Page 24)

Upvotes: 3

Views: 4220

Answers (1)

MrPisarik
MrPisarik

Reputation: 1290

Tested Solution

tables = camelot.read_pdf('./catalogo-motorini-alter.pdf', pages='24', 
                           flavor='stream', columns=['300'], split_text=True)

The output of tables[0].df is following:

                                         0                                                  1
0                CATALOGO SIOM ALTERNATORI                   BOSCH  \nBOSCH  \nBOSCH  \nBOSCH
1                       ALT4800\n12 V\n65A                                ALT4830\n12 V\n70 A
2   IMPIANTO : BOSCH\nCOD.OEM : 0120489186             IMPIANTO : BOSCH\nCOD.OEM : 0120488172
3           APPLICAZIONI :\n OPEL VAUXHALL                     APPLICAZIONI :\n OPEL VAUXHALL
4                      ALT4840\n12 V\n70 A                                ALT4890\n12 V\n90 A
5   IMPIANTO : BOSCH\nCOD.OEM : 0120488186             IMPIANTO : BOSCH\nCOD.OEM : 0123315500
6           APPLICAZIONI :\n OPEL VAUXHALL                             APPLICAZIONI :\n IVECO
7                      ALT4900\n12 V\n90 A                            ALT4940\n24 V\n70/140 A
8   IMPIANTO : BOSCH\nCOD.OEM : 0123320009             IMPIANTO : BOSCH\nCOD.OEM : 0120689535
9           APPLICAZIONI :\n AUDI SKODA VW  APPLICAZIONI :\n DROGMOLLER KASSBOHRER MERCEDE...
10                 ALT4945\n24 V\n70/140 A                                ALT5860\n12 V\n90 A
11  IMPIANTO : BOSCH\nCOD.OEM : 0120689541             IMPIANTO : BOSCH\nCOD.OEM : 0120450011
12      APPLICAZIONI :\n MAN MERCEDES BENZ                          APPLICAZIONI :\n CHRYSLER
13                     ALT6600\n12 V\n90 A                                ALT6610\n24 V\n80 A
14  IMPIANTO : BOSCH\nCOD.OEM : 0124325058             IMPIANTO : BOSCH\nCOD.OEM : 0124555001
15            APPLICAZIONI :\n FIAT LANCIA                     APPLICAZIONI :\n MERCEDES BENZ
16                                     Pag                                                .24

Explanation

From the docs it seems that stream parser fits better than lattice for the shared document:

Stream can be used to parse tables that have whitespaces between cells to simulate a table structure.

And for the cases when a stream parser finds incorrect columns separators you can specify them by hand in columns argument (details). Then split_text option says to split text with those columns:)

Discussions

Although fpbhb criticized scraping PDFs in comments, I would be rather optimistic in your specific case. The document you shared is well structured. So I would definitely try to parse it. But the point of fpbhb still correct that it is heuristic. So additional precautions are required.

I suggest you to use regular expressions to test what you got from camelot.

You can use the code below as a starting point:

import re
import logging

def test_tables(tables):
    # headers
    HEADER_L = re.compile('^CATALOGO SIOM ALTERNATORI$')
    HEADER_R = re.compile('^BOSCH  \nBOSCH  \nBOSCH  \nBOSCH$')

    # main cell rows
    CELL_ROWS = [
        re.compile('^ALT\d{4,6}?\n(12|14|24|28) ?V\n\d{2,3}(/\d{2,3})? ?A$'),
        re.compile('^IMPIANTO : .*?\nCOD.OEM : [\dA]{9,10}$'),
        re.compile('^APPLICAZIONI :(\n[A-Z \.-]*)?$')
    ]

    # bottom line should be Pag.##
    PAGE = re.compile('^Pag.\d{1,3}$')

    for ti, table in enumerate(tables):
        rows = table.df.to_numpy()
        # test headers
        if not HEADER_L.match(rows[0, 0]):
            logging.warning('tables[{}].df.iloc[0][0]: HEADER_L != {}'.format(ti, rows[0, 0]))
        if not HEADER_R.match(rows[0, 1]):
            logging.warning('tables[{}].df.iloc[0][1]: HEADER_R != {}'.format(ti, rows[0, 1]))

        # test bottom line
        page_str = ''.join(rows[-1])
        if not PAGE.match(page_str):
            logging.warning('tables[{}].df.iloc[-1]: PAGE != {}'.format(ti, page_str))

        # test cells
        for idx, row in enumerate(rows[1:-1]):
            row_idx = idx % 3
            pattern = CELL_ROWS[row_idx]
            if not pattern.match(row[0]):
                logging.warning('tables[{}].df.iloc[{}][0]: ROW {} != {}'.format(ti, idx+1, row_idx, row[0]))
            if not pattern.match(row[1]):
                logging.warning('tables[{}].df.iloc[{}][1]: ROW {} != {}'.format(ti, idx+1, row_idx, row[1]))

Test first 24 pages

pages_till_24 = ','.join([str(i) for i in range(1,25)])
tables = camelot.read_pdf('./catalogo-motorini-alter.pdf', pages=pages_till_24, 
                          flavor='stream', columns=['300'], split_text=True) 
test_tables(tables)

It gives only one insignificant warning (extra whitespace)

WARNING:root:tables[8].df.iloc[7][1]: ROW 0 != ALT122300
12 V 
45 A

Conclusion

Well, It looks like you can be happy, because it seems to work and you have code to test other pages. Good Luck:)

Upvotes: 5

Related Questions