Reputation: 217
def get_description(link):
redditFile = urllib2.urlopen(link)
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml)
desc = soup.find('div', attrs={'class': 'op_gd14 FL'}).text
return desc
This is the code which gives me text from this html
<div class="op_gd14 FL">
<p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br>
<a href="../../company-notices/nestleindia/notices/PEP02">Read all announcements in Prestige Estate</a> </p><p> </p>
</div>
This result is fine for me, I just want to exclude the content of
<a href="../../company-notices/nestleindia/notices/PEP02">Read all announcements in Prestige Estate</a>
from result, that is desc
in my script, if it is present and Ignore if it is not present. How can I do this?
Upvotes: 1
Views: 27
Reputation: 626932
You can use extract()
to remove unnecessary tags from the find()
result:
descItem = soup.find('div', attrs={'class': 'op_gd14 FL'}) # get the DIV
[s.extract() for s in descItem('a')] # remove <a> tags
return descItem.get_text() # return the text
Upvotes: 2
Reputation: 926
Just make some changes to last line and add re module
...
return re.sub(r'<a(.*)</a>','',desc)
Output:
'<div class="op_gd14 FL">\n <p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br> \n </p><p>
Upvotes: 1