Reputation: 161
<span class = 'ocrx_word' id = 'word_1_45' title = 'bbox 369 429 301 123;x_wconf 96'>refrence</span>
how to extract only 369 429 301 123 value from above code using python?
Upvotes: 0
Views: 206
Reputation: 11525
from bs4 import BeautifulSoup
import re
data = """<span class = 'ocrx_word' id = 'word_1_45' title = 'bbox 369 429 301 123;x_wconf 96'>refrence</span>
"""
soup = BeautifulSoup(data, 'html.parser')
new = soup.find("span", {'class': 'ocrx_word'}).get("title")
print(re.findall(r"(?<=bbox )(?:\d+ ){3}\d+", new))
Upvotes: 1
Reputation: 16172
The simplest way to approach this is most likely to split the text by the semicolon to get everything before that. Then you can split that again and keep only the numeric parts.
from bs4 import BeautifulSoup
tag = "<span class = 'ocrx_word' id = 'word_1_45' title = 'bbox 369 429 301 123;x_wconf 96'>refrence</span>"
soup = BeautifulSoup(tag, 'html.parser')
s = soup.findAll('span')
for span in s:
print([x for x in span.attrs['title'].split(';')[0].split() if x.isdigit()])
Upvotes: 1