How can I remove everything after a specific text present in html ? Using python and beautifulsoup4

Question

I'm trying to scrape wikipedia. I wish to get only the desired data and discard everthing which is unncessary such as See also, References, etc.


     See also


     List of adaptations of works by Stephen King
     Castle Rock (Stephen King)
     Charles Scribner's Sons (aka Scribner)
     Derry (Stephen King)
     Dollar Baby
     Jerusalem's Lot (Stephen King)
     Haven

As shown in the above HTML. If I find See also in h2 tag, I want to delete everything which is followed by it. unordered list in this case.

Andrej Kesely · Accepted Answer

You can use CSS selector with ~ to select right elements to extract:

from bs4 import BeautifulSoup

txt = '''
This I want to keep

     See also


     List of adaptations of works by Stephen King
     Castle Rock (Stephen King)
     Charles Scribner's Sons (aka Scribner)
     Derry (Stephen King)
     Dollar Baby
     Jerusalem's Lot (Stephen King)
     Haven

'''

soup = BeautifulSoup(txt, 'html.parser')

for tag in soup.select('h2:contains("See also") ~ *, h2:contains("See also")'):
    tag.extract()

print(soup)

Prints:

This I want to keep

NOTE: Newer versions of bs4 use :-soup-contains instead of :contains

How can I remove everything after a specific text present in html ? Using python and beautifulsoup4

Answers (1)

Related Questions