Reputation: 1096
I need to convert hundreds of html sentences generated by an outside source to readable text, and I have a question about conversion of abbr
tag. Below is an example:
from bs4 import BeautifulSoup
text = "<abbr title=\"World Health Organization\" style=\"color:blue\">WHO</abbr> is a specialized agency of the <abbr title=\"United Nations\" style=\"color:#CCCC00\">UN</abbr>."
print (BeautifulSoup(text).get_text())
This code returns "WHO is a specialized agency of the UN.". However, what I want is "WHO (World Health Organization) is a specialized agency of the UN (United Nations)." Is there a way to do this? Maybe another module rather than BeautifulSoup?
Upvotes: 4
Views: 389
Reputation: 27723
Probably, with one of the worst algorithms in the history of algorithms:
import re
from bs4 import BeautifulSoup
text = "<abbr title=\"World Health Organization\" style=\"color:blue\">WHO</abbr> is a specialized agency of the <abbr title=\"United Nations\" style=\"color:#CCCC00\">UN</abbr>."
soup = BeautifulSoup(text, 'html.parser')
inside_abbrs = soup.find_all('abbr')
string_out = ''
for i in inside_abbrs:
s = BeautifulSoup(str(i), 'html.parser')
t = s.find('abbr').attrs['title']
split_soup = re.findall(r"[\w]+|[.,!?;]", soup.text)
bind_caps = ''.join(re.findall(r'[A-Z]', t))
for word in split_soup:
if word == bind_caps:
string_out += word + " (" + t + ") "
break
else:
string_out += word + " "
string_out = string_out.strip()
string_out += '.'
print(string_out)
WHO (World Health Organization) WHO is a specialized agency of the UN (United Nations).
Upvotes: 0
Reputation: 71451
You can iterate over the elements in soup.contents
:
from bs4 import BeautifulSoup as soup
text = "<abbr title=\"World Health Organization\" style=\"color:blue\">WHO</abbr> is a specialized agency of the <abbr title=\"United Nations\" style=\"color:#CCCC00\">UN</abbr>."
d = ''.join(str(i) if i.name is None else f'{i.text} ({i["title"]})' for i in soup(text, 'html.parser').contents)
Output:
'WHO (World Health Organization) is a specialized agency of the UN (United Nations).'
Upvotes: 1