PMSK
PMSK

Reputation: 163

PDFplumber omits rightmost column in table

I wonder if anyone has any clues to the missing column? I have been using pdfplumber to extract table data with good results apart from one particular set of PDFs. The problem is that while page.search finds the rightmost column in the table, extract_table omits the rightmost column. This is on Windows 11. Here is an image of the PDF: Image of the PDF Link to PDF file on Dropbox:- https://www.dropbox.com/scl/fi/d3cg802h7cawl6vw9i7cm/testdoc.pdf?rlkey=tmz390ly5fbug0xi0kx06b2kt&dl=0

Here is the page image with vertical lines superimposed: Using PDFplumber's image debugging to show where the vertical lines are Here is the minimal code:`

# pdftesting.py
import pdfplumber
import sys

print('pdfplumber version:', pdfplumber.__version__)
print('Python version:', sys.version)
filepath  = 'C:/ProgramData/PythonProgs/testing/testdoc.pdf'
fn = pdfplumber.open(filepath)
page = fn.pages[0]

vlines = [26.0, 106.25, 152.25, 251.25, 395.5, 467.25, 539.5, 624.65, 692.5, \
          760.15, 818.9811199999999]
imagefile = 'C:/ProgramData/PythonProgs/testing/testdoc.png'
im = page.to_image(resolution=300)
im.draw_vlines(vlines, stroke_width=3)
im.save(imagefile)
lines = page.extract_table(table_settings=\
       {"vertical_strategy":"explicit",\
        "explicit_vertical_lines":vlines,\
        "horizontal_strategy": 'text',\
        "snap_tolerance": 5})
for item in lines:
    print('line:', item)
    
print('page width:', page.width)
target = 'inc'
X0 = page.search(target)[0]['x0']
X1 = page.search(target)[0]['x1']
size = page.search(target)[0]['chars'][0]['size']
print('Found:', target, X0, X1, size)
`

Here is the output from the code:

pdfplumber version: 0.11.0
Python: 3.12.0 (tags/v3.12.0:0fb18b0, Oct  2 2023, 13:03:39) [MSC v.1935 64 bit (AMD64)]
line: ['', '', '', 'tne minute, rou', 'naea up to tn', 'e nearest mi', 'nute', '', '']
line: ['UK calls', '', '', '', '', '', '', '', '']
line: ['', '', '', '', '', '', '', '', '']
line: ['Date', 'Time', 'Phone number', 'Destination', 'Duration', 'Charged', 'Included?', 'VAT', 'VAT']
line: ['', '', '', '', 'hh:mm:ss', 'hh:mm:ss', '', 'ex', 'rate']
line: ['', '', '', '', '', '', '', '', '']
line: ['Sun 17 May', '15:55', '07755221961', 'UK mobile', '00:05:26', '00:05:26', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['Thu 21 May', '11:15', '07818818242', 'Vodafone mobile', '00:00:07', '00:01:00', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['Fri 22 May', '15:44', '05706000459', 'Landline', '00:00:04', '00:01:00', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['Mon 25 May', '20:48', '02085462206', 'Landline', '00:15:12', '00:15:12', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['Sat 50 May', '10:58', '02056549856', 'Landline', '00:00:08', '00:01:00', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['Fri 5 Jun', '09:58', '07818818242', 'Vodafone mobile', '00:00:11', '00:01:00', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['Sat 6 Jun', '07:17', '07716065665', 'Vodafone mobile', '00:01:14', '00:01:14', 'Yes', '£0.000', '20%']
line: ['', '', '', '', '', '', '', '', '']
line: ['', '', '', 'Tot', 'al of 7 calls', '23 mins 52 s', '', '£0.000', '']
page width: 856.800048828
Found: inc 761.15 773.9811199999999 9.961000000000013

Upvotes: 4

Views: 227

Answers (4)

stenag
stenag

Reputation: 114

Please try changing the 11th entry in vlines to 794

So, change vlines from ,

vlines = [26.0, 106.25, 152.25, 251.25, 395.5, 467.25, 539.5, 624.65, 692.5, \
          760.15, 818.9811199999999]

To,

vlines = [26.0, 106.25, 152.25, 251.25, 395.5, 467.25, 539.5, 624.65, 692.5, \
          760.15, 794]

When I did this and ran the code I got this into lines,

lines =  [['', '', '', 'tne minute, rou', 'naea up to tn', 'e nearest mi', 'nute', '', '', ''], ['UK calls', '', '', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '', '', ''], ['Date', 'Time', 'Phone number', 'Destination', 'Duration', 'Charged', 'Included?', 'VAT', 'VAT', 'VAT'], ['', '', '', '', 'hh:mm:ss', 'hh:mm:ss', '', 'ex', 'rate', 'inc'], ['', '', '', '', '', '', '', '', '', ''], ['Sun 17 May', '15:55', '07755221961', 'UK mobile', '00:05:26', '00:05:26', 'Yes', '£0.000', '20%', '£0.000'], ['', '', '', '', '', '', '', '', '', ''], ['Thu 21 May', '11:15', '07818818242', 'Vodafone mobile', '00:00:07', '00:01:00', 'Yes', '£0.000', '20%', '£0.000'], ['', '', '', '', '', '', '', '', '', ''], ['Fri 22 May', '15:44', '05706000459', 'Landline', '00:00:04', '00:01:00', 'Yes', '£0.000', '20%', '£0.000'], ['', '', '', '', '', '', '', '', '', ''], ['Mon 25 May', '20:48', '02085462206', 'Landline', '00:15:12', '00:15:12', 'Yes', '£0.000', '20%', '£0.000'], ['', '', '', '', '', '', '', '', '', ''], ['Sat 50 May', '10:58', '02056549856', 'Landline', '00:00:08', '00:01:00', 'Yes', '£0.000', '20%', '£0.000'], ['', '', '', '', '', '', '', '', '', ''], ['Fri 5 Jun', '09:58', '07818818242', 'Vodafone mobile', '00:00:11', '00:01:00', 'Yes', '£0.000', '20%', '£0.000'], ['', '', '', '', '', '', '', '', '', ''], ['Sat 6 Jun', '07:17', '07716065665', 'Vodafone mobile', '00:01:14', '00:01:14', 'Yes', '£0.000', '20%', '£0.000'], ['', '', '', '', '', '', '', '', '', ''], ['', '', '', 'Tot', 'al of 7 calls', '23 mins 52 s', '', '£0.000', '', '£0.000']]

Here is the 3rd last entry in lines for example,

['Sat 6 Jun', '07:17', '07716065665', 'Vodafone mobile', '00:01:14', '00:01:14', 'Yes', '£0.000', '20%', '£0.000']

You can see each list in lines has 10 entries

However here are my python and pdfplumber versions

pdfplumber version: 0.11.4
Python version: 3.13.0 

P.S. I deleted this answer when I saw the location of the last line at 794 on the PNG. I has room for 0.000 but not for larger values. 794 worked for me as the widest I could go. It is not an answer as to why it won't work wider than that and I don't know the answer to that. Just putting the answer back following @K J comment, thanks.

Upvotes: 2

Vitalizzare
Vitalizzare

Reputation: 7250

Resolving table extraction issues
with combined "text" and "explicit" strategies

Here's what happens when you choose the horizontal strategy "text":

  1. Words on the page are clustered based on their "top" parameter.
  2. These clusters are filtered by a minimum word count.
  3. Two horizontal edges are constructed at the top and bottom of each remaining cluster, using a fixed leftmost "x0" and rightmost "x1" for all.
  4. Cells are created at the intersections of these horizontal edges with the given vertical lines.
  5. Tables are formed by combining adjacent cells.

The issue you experienced occurred at step 4 because the horizontal edges built on words are shorter than the maximum width between the vertical lines. Let's visualize this:

import pdfplumber

filepath  = '/home/jakito/Desktop/testdoc.pdf'
pdf = pdfplumber.open(filepath)
page = pdf.pages[0]

v_lines = [26.0, 106.25, 152.25, 251.25, 395.5, 467.25, 539.5, 624.65, 692.5, 760.15, 818.9811199999999]
table_settings={
    "vertical_strategy": "explicit",
    "explicit_vertical_lines": v_lines,
    "horizontal_strategy": "text",
    "snap_tolerance": 5}
page.to_image(resolution=400).debug_tablefinder(table_settings).show()

horizontals do not intersect with the vertical on the right

Note that the horizontal edges do not touch the rightmost vertical line. As a result, there are not enough vertices to construct cells in the last column.

Here are several ways to resolve the issue:

  1. Adjust the position of the last vertical line so that it touches the words on the right
v_lines[-1] = max(char['x1'] for char in page.chars)

replaced position of the right vertical line

  1. Apply the "text" strategy to both directions with adjusted word limits
table_settings={
        "vertical_strategy":"text",
        "horizontal_strategy": "text",
        "min_words_vertical": 3,
        "min_words_horizontal": 11
}

page.to_image(resolution=300).debug_tablefinder(table_settings).show()

text strategy

  1. Use "explicit" horizontal strategy
from pdfplumber.table import words_to_edges_h

words = page.extract_words()
h_edges = words_to_edges_h(words, word_threshold=6)
h_lines = [x['top'] for x in h_edges[::2] if 0 <= x['top'] <= page.height]

v_lines = [26.0, 106.25, 152.25, 251.25, 395.5, 467.25, 539.5, 624.65, 692.5, 760.15, 818.9811199999999]

table_settings={
        "vertical_strategy":"explicit",
        "explicit_vertical_lines":v_lines,
        "horizontal_strategy": "explicit",
        "explicit_horizontal_lines": h_lines
}

page.to_image(resolution=300).debug_tablefinder(table_settings).show()

explicit horizontal strategy

Note: Currently, words_to_edges_h returns two edges for each cluster, which is excessive. To address this, I filtered them using h_edges[::2]. The lowest line can be added manually if needed, but in this case, it can be omitted. Additionally, I applied filtering based on 0 and the page height due to the specifics of the sample document, which appears to be a cropped version of a larger one. word_threshold=6 was added to avoid splitting "hh:mm:ss ..." into a separate line.

Upvotes: 2

stenag
stenag

Reputation: 114

def edges_to_intersections(
    edges: T_obj_list, x_tolerance: T_num = 1, y_tolerance: T_num = 1
) -> T_intersections:
    """
    Given a list of edges, return the points at which they intersect
    within `tolerance` pixels.
    """
    intersections: T_intersections = {}
    
    # Added lines starting here, get the last x1 out of edges -> wlastedgesx1
    # get the first x1 out of edges -> wfirstedgesx1
    # if wlastedgesx1 > wfirstedgesx1 (vertical lines and horizontal text) 
    #   for every edges with x1 = wfirstedgesx1, set it to wlastedgesx1
    #only do something is all horizontal have the same x1
    wdosomething='YES'
    wedgeslen=len(edges)
    wfirstedges=edges[0]
    wfirstedgesx1=wfirstedges['x1']
    
    wedgesn2=0
    while wedgesn2<wedgeslen:
        wthisedges=edges[wedgesn2]
        if wthisedges['orientation'] == 'h':
            if wthisedges['x1'] != wfirstedgesx1:
                wdosomething=''+'NO '
                break
        else:
            break
        wedgesn2+=1

    if wdosomething=='YES':
        wlastedges=edges[wedgeslen-1]
        wlastedgesx1=wlastedges['x1']
        if wlastedgesx1 > wfirstedgesx1:
            wedgesn=0
            while wedgesn<wedgeslen:
                wthisedges=edges[wedgesn]
                if wthisedges['x1'] == wfirstedgesx1:
                    wthisedges['x1'] = wlastedgesx1
                    edges[wedgesn]=wthisedges
                wedgesn+=1

I have added the lines shown above from after this line,

intersections: T_intersections = {}

to table.py

Now your example pulls out the 10 columns from the table.

= = =

When there is a table and vertical is by lines and horizontal is by text, every horizontal entry (the first entries) in edges have x1 = 791.228059825 so pdfplumber has calculated the righthand edge of the table is 791.228059825 for those (horizontal) entries. The second and last set of entries in edges are the vertical entries and the last one of them has 818.9811199999999 in x1. 818.9811199999999 is your last vertical line (in vlines in your code).

The change looks for where the first entry in edges has x1 less than x1 in the last entry. If it does then it sets every x1 where x1 = the first entry to x1 of the last entry.

In your example , edges before the new code ,

[{'x0': 26.25, 'x1': 791.228059825, 'top': -2.5922303459999796, 'bottom': -2.5922303459999796, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 7.759869554000005, 'bottom': 7.759869554000005, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 25.120349454000007, 'bottom': 25.120349454000007, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 37.558197654, 'bottom': 37.558197654, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 50.08257475400001, 'bottom': 50.08257475400001, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 61.358104454000014, 'bottom': 61.358104454000014, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 69.11927415400001, 'bottom': 69.11927415400001, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 78.62654415400002, 'bottom': 78.62654415400002, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 84.77466511400002, 'bottom': 84.77466511400002, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 94.52591730400002, 'bottom': 94.52591730400002, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 100.721778454, 'bottom': 100.721778454, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 110.47587845400001, 'bottom': 110.47587845400001, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 116.921810054, 'bottom': 116.921810054, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 126.42587005400001, 'bottom': 126.42587005400001, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 132.869274154, 'bottom': 132.869274154, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 142.37654415400002, 'bottom': 142.37654415400002, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 148.578019454, 'bottom': 148.578019454, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 158.32591730400003, 'bottom': 158.32591730400003, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 164.51956645400003, 'bottom': 164.51956645400003, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 174.27591730400002, 'bottom': 174.27591730400002, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 179.994190954, 'bottom': 179.994190954, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 791.228059825, 'top': 190.03954095400002, 'bottom': 190.03954095400002, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.0, 'x1': 26.0, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}, {'x0': 106.25, 'x1': 106.25, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}, {'x0': 152.25, 'x1': 152.25, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}, {'x0': 251.25, 'x1': 251.25, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}, {'x0': 395.5, 'x1': 395.5, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}, {'x0': 467.25, 'x1': 467.25, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}, {'x0': 539.5, 'x1': 539.5, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}, {'x0': 624.65, 'x1': 624.65, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}, {'x0': 692.5, 'x1': 692.5, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}, {'x0': 760.15, 'x1': 760.15, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}, {'x0': 818.9811199999999, 'x1': 818.9811199999999, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}]

An excerpt with the first 2 lines and the last line,

[

{'x0': 26.25, 'x1': 791.228059825, 'top': -2.5922303459999796, 'bottom': -2.5922303459999796, 'width': 764.978059825, 'orientation': 'h'}, 

{'x0': 26.25, 'x1': 791.228059825, 'top': 7.759869554000005, 'bottom': 7.759869554000005, 'width': 764.978059825, 'orientation': 'h'}, 
:
{'x0': 818.9811199999999, 'x1': 818.9811199999999, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}

]

The entries are the horizontal entries followed by the vertical entries. The entries with x1 = 791.228059825 are the horizontal entries. They are followed by entries where x1 is not 791.228059825. These are the vertical entries. The last of these and indeed the last in edges has x1 = 818.9811199999999

The new lines of code reckon that when the last x1 is greater than the first x1 then the table is vertical lines and horizontal text. When the last x1 is greater than the first x1 then the new lines of code change every x1 that is equal to the first x1 to be equal the last x1. This changes edges from the above to,

[{'x0': 26.25, 'x1': 818.9811199999999, 'top': -2.5922303459999796, 'bottom': -2.5922303459999796, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 7.759869554000005, 'bottom': 7.759869554000005, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 25.120349454000007, 'bottom': 25.120349454000007, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 37.558197654, 'bottom': 37.558197654, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 50.08257475400001, 'bottom': 50.08257475400001, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 61.358104454000014, 'bottom': 61.358104454000014, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 69.11927415400001, 'bottom': 69.11927415400001, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 78.62654415400002, 'bottom': 78.62654415400002, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 84.77466511400002, 'bottom': 84.77466511400002, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 94.52591730400002, 'bottom': 94.52591730400002, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 100.721778454, 'bottom': 100.721778454, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 110.47587845400001, 'bottom': 110.47587845400001, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 116.921810054, 'bottom': 116.921810054, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 126.42587005400001, 'bottom': 126.42587005400001, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 132.869274154, 'bottom': 132.869274154, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 142.37654415400002, 'bottom': 142.37654415400002, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 148.578019454, 'bottom': 148.578019454, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 158.32591730400003, 'bottom': 158.32591730400003, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 164.51956645400003, 'bottom': 164.51956645400003, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 174.27591730400002, 'bottom': 174.27591730400002, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 179.994190954, 'bottom': 179.994190954, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.25, 'x1': 818.9811199999999, 'top': 190.03954095400002, 'bottom': 190.03954095400002, 'width': 764.978059825, 'orientation': 'h'}, {'x0': 26.0, 'x1': 26.0, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}, {'x0': 106.25, 'x1': 106.25, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}, {'x0': 152.25, 'x1': 152.25, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}, {'x0': 251.25, 'x1': 251.25, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}, {'x0': 395.5, 'x1': 395.5, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}, {'x0': 467.25, 'x1': 467.25, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}, {'x0': 539.5, 'x1': 539.5, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}, {'x0': 624.65, 'x1': 624.65, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}, {'x0': 692.5, 'x1': 692.5, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}, {'x0': 760.15, 'x1': 760.15, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}, {'x0': 818.9811199999999, 'x1': 818.9811199999999, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}]

An excerpt with the first 2 lines and the last line is,

[

{'x0': 26.25, 'x1': 818.9811199999999, 'top': -2.5922303459999796, 'bottom': -2.5922303459999796, 'width': 764.978059825, 'orientation': 'h'}, 

{'x0': 26.25, 'x1': 818.9811199999999, 'top': 7.759869554000005, 'bottom': 7.759869554000005, 'width': 764.978059825, 'orientation': 'h'}, 

:

{'x0': 818.9811199999999, 'x1': 818.9811199999999, 'top': -0.021000001999993856, 'bottom': 211.32901745200002, 'height': 211.350017454, 'orientation': 'v'}

]

I have tested it with your example. I have tested it with an excel spreadsheet with a few different sized tables extracting them with horizontal and vertical lines. Following this test I changed the code to avoid it doing any updates where all the horizontal entries in edges did not have matching x1.

= = =

Where I found the issue and to show the reason I added the above code is,

v_edges, h_edges = [
    list(filter(lambda x: x["orientation"] == o, edges)) for o in ("v", "h")
]
for v in sorted(v_edges, key=itemgetter("x0", "top")):
    for h in sorted(h_edges, key=itemgetter("top", "x0")):
        if (
            (v["top"] <= (h["top"] + y_tolerance))
            and (v["bottom"] >= (h["top"] - y_tolerance))
            and (v["x0"] >= (h["x0"] - x_tolerance))
            and (v["x0"] <= (h["x1"] + x_tolerance))       #wdebug
        ):
            vertex = (v["x0"], h["top"])
            if vertex not in intersections:
                intersections[vertex] = {"v": [], "h": []}
            intersections[vertex]["v"].append(v)
            intersections[vertex]["h"].append(h)
        else: 
            wdebug = 1
return intersections

This section of code is immediately after the code I added. I have also added the 3rd and 2nd last lines. They are just to trap the issue. I have added,

        else: 
            wdebug = 1

If you do nothing else but add these 2 lines and add a breakpoint on ,

            wdebug = 1

Then run your program you will see the issue happening for your example.

Upvotes: 1

K J
K J

Reputation: 11857

Without running PDFPlumber, I can only say that, any other method has no problem with the file, as it cleanly has a Table structure from Abbyy Finereader OCR.

Even the Playground sees the columns when the file is run as text text. enter image description here

There are a few oddities like 2 image backgrounds and a page size that is larger than Euro A4 For example here is the green image. and such disturbances may be throwing PDFPlumber off track ?

enter image description here

Here is the text as a table overlay extraction from MS Word and a csv export from a PDF Reader.

enter image description here

So from the PDFPlumber demo there is full width but as for me just a "demo" only shows/exports the first table line!

enter image description here

Upvotes: 1

Related Questions