Noob
Noob

Reputation: 117

How can I remove everything after a specific text present in html ? Using python and beautifulsoup4

I'm trying to scrape wikipedia. I wish to get only the desired data and discard everthing which is unncessary such as See also, References, etc.

<h2>
     <span class="mw-headline" id="See_also">See also</span>
</h2>
<ul>
     <li><a href="/wiki/List_of_adaptations_of_works_by_Stephen_King" title="List of adaptations of works by Stephen King">List of adaptations of works by Stephen King</a></li>
     <li><a href="/wiki/Castle_Rock_(Stephen_King)" title="Castle Rock (Stephen King)">Castle Rock (Stephen King)</a></li>
     <li><a href="/wiki/Charles_Scribner%27s_Sons" title="Charles Scribner&#39;s Sons">Charles Scribner's Sons</a> (aka Scribner)</li>
     <li><a href="/wiki/Derry_(Stephen_King)" title="Derry (Stephen King)">Derry (Stephen King)</a></li>
     <li><a href="/wiki/Dollar_Baby" title="Dollar Baby">Dollar Baby</a></li>
     <li><a href="/wiki/Jerusalem%27s_Lot_(Stephen_King)" title="Jerusalem&#39;s Lot (Stephen King)">Jerusalem's Lot (Stephen King)</a></li>
     <li><i><a href="/wiki/Haven_(TV_series)" title="Haven (TV series)">Haven</a></i></li>
</ul>

As shown in the above HTML. If I find See also in h2 tag, I want to delete everything which is followed by it. unordered list in this case.

Upvotes: 1

Views: 312

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195468

You can use CSS selector with ~ to select right elements to extract:

from bs4 import BeautifulSoup

txt = '''
<div>This I want to keep</div>
<h2>
     <span class="mw-headline" id="See_also">See also</span>
</h2>
<ul>
     <li><a href="/wiki/List_of_adaptations_of_works_by_Stephen_King" title="List of adaptations of works by Stephen King">List of adaptations of works by Stephen King</a></li>
     <li><a href="/wiki/Castle_Rock_(Stephen_King)" title="Castle Rock (Stephen King)">Castle Rock (Stephen King)</a></li>
     <li><a href="/wiki/Charles_Scribner%27s_Sons" title="Charles Scribner&#39;s Sons">Charles Scribner's Sons</a> (aka Scribner)</li>
     <li><a href="/wiki/Derry_(Stephen_King)" title="Derry (Stephen King)">Derry (Stephen King)</a></li>
     <li><a href="/wiki/Dollar_Baby" title="Dollar Baby">Dollar Baby</a></li>
     <li><a href="/wiki/Jerusalem%27s_Lot_(Stephen_King)" title="Jerusalem&#39;s Lot (Stephen King)">Jerusalem's Lot (Stephen King)</a></li>
     <li><i><a href="/wiki/Haven_(TV_series)" title="Haven (TV series)">Haven</a></i></li>
</ul>
'''

soup = BeautifulSoup(txt, 'html.parser')

for tag in soup.select('h2:contains("See also") ~ *, h2:contains("See also")'):
    tag.extract()

print(soup)

Prints:

<div>This I want to keep</div>

NOTE: Newer versions of bs4 use :-soup-contains instead of :contains

Upvotes: 2

Related Questions