oliverbj
oliverbj

Reputation: 6052

Dynamically filter list and remove item in loop

I have the following data (represented in a list in my code):

word_list = [{'bottom': Decimal('58.650'),  
  'text': 'Contact'
 },
 {'bottom': Decimal('77.280'),  
  'text': '[email protected]'
 },
 {'bottom': Decimal('101.833'),
  'text': 'www.domain.com'
 },
 {'bottom': Decimal('116.233'),
  'text': '(Acme INC)'
 },
 {'bottom': Decimal('74.101'),
  'text': 'Oliver'
 },
 {'bottom': Decimal('90.662'),
  'text': 'CEO'
 }]

The above data is coming from a PDF text extraction. I am trying to parse this and keep the layout formatting, based on the bottom values.

The thought is to check the bottom value for the current word, and then find all matching words, that is within a specific range with a tolerance of threshold=.

This is my code:

threshold = float('10')
current_row = [word_list[0], ]
row_list = [current_row, ]

for word in word_list[1:]:

    if abs(current_row[-1]['bottom'] - word['bottom']) <= threshold:
       # distance is small, use same row
       current_row.append(word)
    else:
       # distance is big, create new row
       current_row = [word, ]
       row_list.append(current_row)

So this will return a list of the words within the approved threshold.

I am a bit stuck here, since it may happen that when iterating the list, that words will have bottom values that are very close to each other, and thus it will select the same close words in multiple iterations.

For example, if a word has a bottom value that is close to a word that is already added to the row_list, it will simply just add it to the list again.

I was wondering if it was maybe possible to delete the words that's already been iterated/added? Something like:


if abs(current_row[-1]['bottom'] - word['bottom']) <= threshold:
   [...]
else:
   [...]

del word from word_list

However I am not sure how to implement this? As I cannot modify the word_list within the loop.

Upvotes: 2

Views: 387

Answers (3)

serv-inc
serv-inc

Reputation: 38177

You can specify the sorting parameter, e.g.

word_list.sort(key=lambda x: x['bottom'])

This results in

word_list.sort(key=lambda x: x['bottom'])
rows = []
current = [word_list.pop(0)]  # reversing the sort and using pop() is more efficient
while word_list:
    if word_list[0]['bottom'] - current[-1]['bottom'] < threshold:
        current.append(word_list.pop(0))
    else:
        rows.append(current)
        current = [word_list.pop(0)]
rows.append(current)

The code iterates through word_list until it is empty. The current word (at position 0, though reversing would increase efficiency) is compared to the last ordered word. End result is (pprint.pprint(rows)):

[[{'bottom': Decimal('58.650'), 'text': 'Contact'}],
 [{'bottom': Decimal('74.101'), 'text': 'Oliver'},
  {'bottom': Decimal('77.280'), 'text': '[email protected]'}],
 [{'bottom': Decimal('90.662'), 'text': 'CEO'}],
 [{'bottom': Decimal('101.833'), 'text': 'www.domain.com'}],
 [{'bottom': Decimal('116.233'), 'text': '(Acme INC)'}]]

Upvotes: 1

amal jith
amal jith

Reputation: 67

You can use a while loop instead of for loop

while len(word_list[1:])!=0:
    word=word_list[1] #as you are deleting item once it is used, next item will come to the beginning of list automatically
    word_list.remove(word)
    if abs(current_row[-1]['bottom'] - word['bottom']) <= threshold:
       [...]
    else:
       [...]

Upvotes: 1

Robert Guggenberger
Robert Guggenberger

Reputation: 108

bottoms = []
for w in word_list:
    bottoms.append(w["bottom"])

current_row = []
row_list = []
key = sorted(bottoms)[0]
threshold = float("10")
for b in sorted(bottoms):
    if abs(b-key) <= threshold:
        idx = bottoms.index(b)
        current_row.append(word_list[idx])
    else:
        row_list.append(current_row)
        idx = bottoms.index(b)
        current_row = [word_list[idx]]
        key = b

for row in row_list:
    print(row)

This always thresholds compared to the lowest value starting a new row, and output is

[{'bottom': Decimal('58.650'), 'text': 'Contact'}]
[{'bottom': Decimal('74.101'), 'text': 'Oliver'}, {'bottom': Decimal('77.280'), 'text': '[email protected]'}]
[{'bottom': Decimal('90.662'), 'text': 'CEO'}]
[{'bottom': Decimal('101.833'), 'text': 'www.domain.com'}]

Upvotes: 0

Related Questions