passsky
passsky

Reputation: 301

Using BeautifulSoup to grab all the HTML between two tags

I have some HTML that looks like this:

<h1>Title</h1>

//a random amount of p/uls or tagless text

<h1> Next Title</h1>

I want to copy all of the HTML from the first h1, to the next h1. How can I do this?

Upvotes: 30

Views: 11898

Answers (4)

AMC
AMC

Reputation: 2702

Here is a complete, up-to-date solution:

Contents of temp.html:

<h1>Title</h1>
<p>hi</p>
//a random amount of p/uls or tagless text
<h1> Next Title</h1>

Code:

import copy

from bs4 import BeautifulSoup

with open("resources/temp.html") as file_in:
    soup = BeautifulSoup(file_in, "lxml")

print(f"Before:\n{soup.prettify()}")

first_header = soup.find("body").find("h1")

siblings_to_add = []

for curr_sibling in first_header.next_siblings:
    if curr_sibling.name == "h1":
        for curr_sibling_to_add in siblings_to_add:
            curr_sibling.insert_after(curr_sibling_to_add)
        break
    else:
        siblings_to_add.append(copy.copy(curr_sibling))

print(f"\nAfter:\n{soup.prettify()}")

Output:

Before:
<html>
 <body>
  <h1>
   Title
  </h1>
  <p>
   hi
  </p>
  //a random amount of p/uls or tagless text
  <h1>
   Next Title
  </h1>
 </body>
</html>

After:
<html>
 <body>
  <h1>
   Title
  </h1>
  <p>
   hi
  </p>
  //a random amount of p/uls or tagless text
  <h1>
   Next Title
  </h1>
  //a random amount of p/uls or tagless text
  <p>
   hi
  </p>
 </body>
</html>

Upvotes: 0

Hunting
Hunting

Reputation: 381

This is the clear BeautifulSoup way, when the second h1 tag is a sibling of the first:

html = u""
for tag in soup.find("h1").next_siblings:
    if tag.name == "h1":
        break
    else:
        html += unicode(tag)

Upvotes: 22

maltman
maltman

Reputation: 371

I have the same problem. Not sure if there is a better solution, but what I've done is use regular expressions to get the indices of the two nodes that I'm looking for. Once I have that, I extract the HTML between the two indexes and create a new BeautifulSoup object.

Example:

m = re.search(r'<h1>Title</h1>.*?<h1>', html, re.DOTALL)
s = m.start()
e = m.end() - len('<h1>')
target_html = html[s:e]
new_bs = BeautifulSoup(target_html)

Upvotes: 5

fmalina
fmalina

Reputation: 6310

Interesting question. There is no way you can use just DOM to select it. You'll have to loop trough all elements preceding the first h1 (including) and put them into intro = str(intro), then get everything up to the 2nd h1 into chapter1. Than remove the intro from the chapter1 using

chapter = chapter1.replace(intro, '')

Upvotes: 0

Related Questions