CodeDecode
CodeDecode

Reputation: 161

How to get coordinates of characters in html document?

<span class = 'ocrx_word' id = 'word_1_45' title = 'bbox 369 429 301 123;x_wconf 96'>refrence</span>

how to extract only 369 429 301 123 value from above code using python?

Upvotes: 0

Views: 206

Answers (2)

from bs4 import BeautifulSoup
import re

data = """<span class = 'ocrx_word' id = 'word_1_45' title = 'bbox 369 429 301 123;x_wconf 96'>refrence</span>
"""

soup = BeautifulSoup(data, 'html.parser')

new = soup.find("span", {'class': 'ocrx_word'}).get("title")

print(re.findall(r"(?<=bbox )(?:\d+ ){3}\d+", new))

Upvotes: 1

Chris
Chris

Reputation: 16172

The simplest way to approach this is most likely to split the text by the semicolon to get everything before that. Then you can split that again and keep only the numeric parts.

from bs4 import BeautifulSoup

tag = "<span class = 'ocrx_word' id = 'word_1_45' title = 'bbox 369 429 301 123;x_wconf 96'>refrence</span>"
soup = BeautifulSoup(tag, 'html.parser')
s = soup.findAll('span')

for span in s:
    print([x  for x in span.attrs['title'].split(';')[0].split() if x.isdigit()])

Upvotes: 1

Related Questions