Reputation: 6052
I have the following data (represented in a list in my code):
word_list = [{'bottom': Decimal('58.650'),
'text': 'Contact'
},
{'bottom': Decimal('77.280'),
'text': '[email protected]'
},
{'bottom': Decimal('101.833'),
'text': 'www.domain.com'
},
{'bottom': Decimal('116.233'),
'text': '(Acme INC)'
},
{'bottom': Decimal('74.101'),
'text': 'Oliver'
},
{'bottom': Decimal('90.662'),
'text': 'CEO'
}]
The above data is coming from a PDF text extraction. I am trying to parse this and keep the layout formatting, based on the bottom
values.
The thought is to check the bottom
value for the current word, and then find all matching words, that is within a specific range with a tolerance of threshold=
.
This is my code:
threshold = float('10')
current_row = [word_list[0], ]
row_list = [current_row, ]
for word in word_list[1:]:
if abs(current_row[-1]['bottom'] - word['bottom']) <= threshold:
# distance is small, use same row
current_row.append(word)
else:
# distance is big, create new row
current_row = [word, ]
row_list.append(current_row)
So this will return a list of the words within the approved threshold.
I am a bit stuck here, since it may happen that when iterating the list, that words will have bottom
values that are very close to each other, and thus it will select the same close words in multiple iterations.
For example, if a word has a bottom value that is close to a word that is already added to the row_list
, it will simply just add it to the list again.
I was wondering if it was maybe possible to delete the words that's already been iterated/added? Something like:
if abs(current_row[-1]['bottom'] - word['bottom']) <= threshold:
[...]
else:
[...]
del word from word_list
However I am not sure how to implement this? As I cannot modify the word_list
within the loop.
Upvotes: 2
Views: 387
Reputation: 38177
You can specify the sorting parameter, e.g.
word_list.sort(key=lambda x: x['bottom'])
This results in
word_list.sort(key=lambda x: x['bottom'])
rows = []
current = [word_list.pop(0)] # reversing the sort and using pop() is more efficient
while word_list:
if word_list[0]['bottom'] - current[-1]['bottom'] < threshold:
current.append(word_list.pop(0))
else:
rows.append(current)
current = [word_list.pop(0)]
rows.append(current)
The code iterates through word_list
until it is empty. The current word (at position 0, though reversing would increase efficiency) is compared to the last ordered word. End result is (pprint.pprint(rows)
):
[[{'bottom': Decimal('58.650'), 'text': 'Contact'}],
[{'bottom': Decimal('74.101'), 'text': 'Oliver'},
{'bottom': Decimal('77.280'), 'text': '[email protected]'}],
[{'bottom': Decimal('90.662'), 'text': 'CEO'}],
[{'bottom': Decimal('101.833'), 'text': 'www.domain.com'}],
[{'bottom': Decimal('116.233'), 'text': '(Acme INC)'}]]
Upvotes: 1
Reputation: 67
You can use a while loop instead of for loop
while len(word_list[1:])!=0:
word=word_list[1] #as you are deleting item once it is used, next item will come to the beginning of list automatically
word_list.remove(word)
if abs(current_row[-1]['bottom'] - word['bottom']) <= threshold:
[...]
else:
[...]
Upvotes: 1
Reputation: 108
bottoms = []
for w in word_list:
bottoms.append(w["bottom"])
current_row = []
row_list = []
key = sorted(bottoms)[0]
threshold = float("10")
for b in sorted(bottoms):
if abs(b-key) <= threshold:
idx = bottoms.index(b)
current_row.append(word_list[idx])
else:
row_list.append(current_row)
idx = bottoms.index(b)
current_row = [word_list[idx]]
key = b
for row in row_list:
print(row)
This always thresholds compared to the lowest value starting a new row, and output is
[{'bottom': Decimal('58.650'), 'text': 'Contact'}]
[{'bottom': Decimal('74.101'), 'text': 'Oliver'}, {'bottom': Decimal('77.280'), 'text': '[email protected]'}]
[{'bottom': Decimal('90.662'), 'text': 'CEO'}]
[{'bottom': Decimal('101.833'), 'text': 'www.domain.com'}]
Upvotes: 0