Reputation: 1175
I'm tring to wrap the contents of a tag with BeautifulSoup. This:
<div class="footnotes">
<p>Footnote 1</p>
<p>Footnote 2</p>
</div>
should become this:
<div class="footnotes">
<ol>
<p>Footnote 1</p>
<p>Footnote 2</p>
</ol>
</div>
So I use the following code:
footnotes = soup.findAll("div", { "class" : "footnotes" })
footnotes_contents = ''
new_ol = soup.new_tag("ol")
for content in footnotes[0].children:
new_tag = soup.new_tag(content)
new_ol.append(new_tag)
footnotes[0].clear()
footnotes[0].append(new_ol)
print footnotes[0]
but I get the following:
<div class="footnotes"><ol><
></
><<p>Footnote 1</p>></<p>Footnote 1</p>><
></
><<p>Footnote 2</p>></<p>Footnote 2</p>><
></
></ol></div>
Suggestions?
Upvotes: 2
Views: 3406
Reputation: 879143
Using lxml:
import lxml.html as LH
import lxml.builder as builder
E = builder.E
doc = LH.parse('data')
footnote = doc.find('//div[@class="footnotes"]')
ol = E.ol()
for tag in footnote:
ol.append(tag)
footnote.append(ol)
print(LH.tostring(doc.getroot()))
prints
<html><body><div class="footnotes">
<ol><p>Footnote 1</p>
<p>Footnote 2</p>
</ol></div></body></html>
Note that with lxml
, an Element (tag) can be in only one place in the tree (since every Element has only one parent), so appending tag
to ol
also removes it from footnote
. So unlike with BeautifulSoup, you do not need to iterate over the contents in reverse order, nor use insert(0,...)
. You just append in order.
Using BeautifulSoup:
import bs4 as bs
with open('data', 'r') as f:
soup = bs.BeautifulSoup(f)
footnote = soup.find("div", { "class" : "footnotes" })
new_ol = soup.new_tag("ol")
for content in reversed(footnote.contents):
new_ol.insert(0, content.extract())
footnote.append(new_ol)
print(soup)
prints
<html><body><div class="footnotes"><ol>
<p>Footnote 1</p>
<p>Footnote 2</p>
</ol></div></body></html>
Upvotes: 5
Reputation: 1121406
Just move the .contents
of your tag over using tag.extract()
; don't try to create them anew with soup.new_tag
(which only takes a tag name, not a whole tag object). Don't call .clear()
on the original tag; .extract()
already removed the elements.
Move items over in reverse as the contents are being modified in-place, leading to skipped elements if you don't watch out.
Finally, use .find()
when you only need to do this for one tag.
You do need to create a copy of the contents
list, as it'll be modified in place
footnotes = soup.find("div", { "class" : "footnotes" })
new_ol = soup.new_tag("ol")
for content in reversed(footnotes.contents):
new_ol.insert(0, content.extract())
footnotes.append(new_ol)
Demo:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <div class="footnotes">
... <p>Footnote 1</p>
... <p>Footnote 2</p>
... </div>
... ''')
>>> footnotes = soup.find("div", { "class" : "footnotes" })
>>> new_ol = soup.new_tag("ol")
>>> for content in reversed(footnotes.contents):
... new_ol.insert(0, content.extract())
...
>>> footnotes.append(new_ol)
>>> print footnotes
<div class="footnotes"><ol>
<p>Footnote 1</p>
<p>Footnote 2</p>
</ol></div>
Upvotes: 4