How to get coordinates of characters in html document?

<span class = 'ocrx_word' id = 'word_1_45' title = 'bbox 369 429 301 123;x_wconf 96'>refrence</span>

how to extract only 369 429 301 123 value from above code using python?

Upvotes: 0

Answers (2)

αԋɱҽԃ αмєяιcαη

Reputation: 11525

from bs4 import BeautifulSoup
import re

data = """<span class = 'ocrx_word' id = 'word_1_45' title = 'bbox 369 429 301 123;x_wconf 96'>refrence</span>
"""

soup = BeautifulSoup(data, 'html.parser')

new = soup.find("span", {'class': 'ocrx_word'}).get("title")

print(re.findall(r"(?<=bbox )(?:\d+ ){3}\d+", new))

Upvotes: 1

Chris

Reputation: 16172

The simplest way to approach this is most likely to split the text by the semicolon to get everything before that. Then you can split that again and keep only the numeric parts.

from bs4 import BeautifulSoup

tag = "<span class = 'ocrx_word' id = 'word_1_45' title = 'bbox 369 429 301 123;x_wconf 96'>refrence</span>"
soup = BeautifulSoup(tag, 'html.parser')
s = soup.findAll('span')

for span in s:
    print([x  for x in span.attrs['title'].split(';')[0].split() if x.isdigit()])

Upvotes: 1

How to get coordinates of characters in html document?

Answers (2)

Related Questions