Reputation: 108
I am parsing html files and replacing specific links with new tags.
Python Code:
from bs4 import BeautifulSoup
sample='''<a href="{Image src='https://google.com' link='https://google.com'}" >{Image src='https://google.com' link='google.com'}</a>'''
soup=BeautifulSoup(sample)
for a in soup.findAll('a'):
x=BeautifulSoup('<ac:image><ri:attachment ri:filename="somefile"/> </ac:image>')
a=a.replace_with(x)
print(soup)
Actual Output:
<ac:image><ri:attachment ri:filename="somefile"></ri:attachment> </ac:image>
Desired Output:
<ac:image><ri:attachment ri:filename="somefile" /></ac:image>
The self closing Tags are automatically getting converted. The destination strictly needs self closing tags.
Any help would be appreciated!
Upvotes: 1
Views: 1063
Reputation: 195428
To get correct self closing tags, use parser xml
when creating new soup that is going to replace old tag.
Also, to preserve ac
and ri
namespaces, xml
parser requires to define xmlns:ac
and xmlns:ri
parameters. We define these parameters in a dummy tag that is removed after processing.
For example:
from bs4 import BeautifulSoup
import xml
txt = '''
<div class="my-class">
<a src="some address">
<img src="attlasian_logo.gif" />
</a>
</div>
<div class="my-class">
<a src="some address2">
<img src="other_logo.gif" />
</a>
</div>
'''
template = '''
<div class="_remove_me" xmlns:ac="http://namespace1/" xmlns:ri="http://namespace2/">
<ac:image>
<ri:attachment ri:filename="{img_src}" />
</ac:image>
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
for a in soup.select('a'):
a=a.replace_with(BeautifulSoup(template.format(img_src=a.img['src']), 'xml')) # <-- select `xml` parser, the template needs to have xmlns:* parameters to preserve namespaces
for div in soup.select('div._remove_me'):
dump=div.unwrap()
print(soup.prettify())
Prints:
<div class="my-class">
<ac:image>
<ri:attachment ri:filename="attlasian_logo.gif"/>
</ac:image>
</div>
<div class="my-class">
<ac:image>
<ri:attachment ri:filename="other_logo.gif"/>
</ac:image>
</div>
Upvotes: 1