cyclic
cyclic

Reputation: 217

Removing particular content from result parces using beautifulsoup

def get_description(link):
    redditFile = urllib2.urlopen(link)
    redditHtml = redditFile.read()
    redditFile.close()
    soup = BeautifulSoup(redditHtml)
    desc = soup.find('div', attrs={'class': 'op_gd14 FL'}).text
    return desc

This is the code which gives me text from this html

    <div class="op_gd14 FL">
    <p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br>  
<a href="../../company-notices/nestleindia/notices/PEP02">Read all announcements in Prestige Estate</a>  </p><p>                                                </p>

</div>

This result is fine for me, I just want to exclude the content of

<a href="../../company-notices/nestleindia/notices/PEP02">Read all announcements in Prestige Estate</a>

from result, that is desc in my script, if it is present and Ignore if it is not present. How can I do this?

Upvotes: 1

Views: 27

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626932

You can use extract() to remove unnecessary tags from the find() result:

descItem = soup.find('div', attrs={'class': 'op_gd14 FL'}) # get the DIV
[s.extract() for s in descItem('a')]                       # remove <a> tags
return descItem.get_text()                                 # return the text

Upvotes: 2

mmachine
mmachine

Reputation: 926

Just make some changes to last line and add re module

...
return re.sub(r'<a(.*)</a>','',desc)

Output:

'<div class="op_gd14 FL">\n    <p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br>  \n  </p><p> 

Upvotes: 1

Related Questions