Kirkman14
Kirkman14

Reputation: 1686

How do I remove almost-duplicate integers from list?

I'm parsing some PDFs in Python. These PDFs are visually organized into rows and columns. The pdftohtml script converts these PDFs to an XML format, full of loose <text> tags which don't have any hierarchy. My code then needs to sort these <text> tags back into rows.

Since each <text> tag has attributes like "top" or "left" coordinates, I wrote code to append <text> items with the same "top" coordinate to a list. This list is effectively one row.

My code first iterates over the page, finds all unique "top" values, and appends them to a tops list. Then it iterates over this tops list. For each unique top value, it searches for all items that have that "top" value and adds them to a row list.

for side in page:
    tops = list( set( [ d['top'] for d in side ] ) )
    tops.sort()
    for top in tops:
        row = []
        for blob in side:
            if int(blob['top']) == int(top):
                row.append(blob)
        rows.append(row)

This code works great for the majority of the PDFs I'm parsing. But there are cases where items which are on the same row have slightly different top values, off by one or two.

I'm trying to adapt my code to become a bit fuzzier.

The comparison at the bottom seems easy enough to fix. Something like this:

        for blob in side:
            rangeLower = int(top) - 2
            rangeUpper = int(top) + 2
            thisTop = int(blob['top'])
            if rangeLower <= thisTop <= rangeUpper :
                row.append(blob)

But the list of unique top values that I create first is a problem. The code I use is

    tops = list( set( [ d['top'] for d in side ] ) )

In these edge cases, I end up with a list like:

[925, 946, 966, 995, 996, 1015, 1035]

How could I adapt that code to avoid having "995" and "996" in the list? I want to ensure I end up with just one value when integers are within 1 or 2 of each other.

Upvotes: 0

Views: 754

Answers (2)

A.J. Uppal
A.J. Uppal

Reputation: 19264

@njzk2's answer works too, but this function actually shows what is going on and is easier to understand:

>>> def sort(list):
...     list.sort() #sorts in ascending order
...     x = range(0, len(list), 1) #gets range
...     x.reverse() #reverses
...     for k in x:
...             if list[k]-1 == list[k-1]: #if the list value -1 is equal to the next,
...                     del(list[k-1])     #remove it
...     return list #return
... 
>>> tops = [925, 946, 966, 995, 996, 1015, 1035]
>>> sort(tops)
[925, 946, 966, 996, 1015, 1035]
>>> 

Upvotes: 0

njzk2
njzk2

Reputation: 39406

  • Sort the list to put the close values next to one another
  • Use reduce to filter the value depending on the previous value

Code:

>>> tops = [925, 946, 966, 995, 996, 1015, 1035]
>>> threshold = 2
>>> reduce(lambda x, y: x + [y] if len(x) == 0 or y > x[-1] + threshold else x, sorted(tops), [])
[925, 946, 966, 995, 1015, 1035]

With several contiguous values:

>>> tops = range(10)
>>> reduce(lambda x, y: x + [y] if len(x) == 0 or y > x[-1] + threshold else x, sorted(tops), [])
[0, 3, 6, 9]

Edit

Reduce can be a little cumbersome to read, so here is a more straightforward approach:

res = []
for item in sorted(tops):
    if len(res) == 0 or item > res[-1] + threshold:
        res.append(item)

Upvotes: 4

Related Questions