Trgovec
Trgovec

Reputation: 575

beautifulsoup extract sentence, if it contains a keyword

I would like to process a html website (e.g. this one: http://www.uni-bremen.de/mscmarbiol/) and save each sentence, which contains a string 'research'.

This is just an example of the codes with which I pulled all the text from the website.

from bs4 import BeautifulSoup
from zipfile import ZipFile
import os
html_page = "example.html" #i saved this page as example locally

data = []
with open(html_page, "r") as html:
    soup = BeautifulSoup(html, "lxml")
    text_group = soup.get_text()

print text_group

What would be the best way to perform a task of exporting only the sentences which contain the word 'research'?

Is there a more elegant way than using .split and seperators for a string? Can something be done with "re"?

Thank you very much for your help as I am very much new to this topic.

Best regards,

Trgovec

Upvotes: 4

Views: 5064

Answers (3)

Denziloe
Denziloe

Reputation: 8162

Considering "sentences" aren't strictly defined in the document, it sounds like you will need to use a tool that splits plaintext into sentences.

The NLTK package is great for this kind of thing. You will want to do something like

import nltk
sentences = nltk.sent_tokenize(text)
result = [sentence for sentence in sentences if "research" in sentence]

It's not perfect (it doesn't understand that "The M.Sc." in your document is not a separate sentence for instance), but sentence segmentation is a deceptively complex task and this is as good as you'll get.

Upvotes: 2

Guillaume
Guillaume

Reputation: 6039

Once you have your soup, you may try:

for tag in soup.descendants:
    if tag.string and 'research' in tag.string:
       print(tag.string)

Faster alternative using XPath, since you have lxml installed:

from lxml import etree
with open(html_page, "r") as html:
    tree = etree.parse(html, parser=etree.HTMLParser())
[e.text for e in tree.xpath("//*[contains(text(), 'research')]")]

Upvotes: 2

宏杰李
宏杰李

Reputation: 12178

In [65]: soup.find_all(name=['p', 'li'], text=re.compile(r'research'))
Out[65]: 
[<p class="bodytext">The M.Sc. programme Marine Biology is strongly research-orientated. The graduates are trained to develop hypotheses-driven research concepts and to design appropriate experimental approaches in order to answer profound questions related to the large field of marine ecosystem and organism functioning and of potential impacts of local, regional and global environmental change. 
 </p>,
 <p class="bodytext">Many courses are actually taught in the laboratories and facilities of the institutes benefiting from cutting-edge research infrastructure and first-hand contact to leading experts. This unique context sets the scene for direct links from current state of research to academic training.</p>,
 <li>Training in state-of-the-art methodologies by leading research teams.</li>,
 <li>Advanced courses in different university departments and associated research institutions.</li>,
 <li>Field trips, excursions or even the opportunity to participate in research expeditions. </li>,
 <p class="bodytext">The University of Bremen and the associated research institutions offer a variety of opportunities to continue an academic career as Ph.D. candidate.
 </p>,
 <p class="bodytext">Employment opportunities for Marine Biologists exist worldwide at institutions committed to research and development, in the fishing and aquaculture industry as well as in the environmental conservation and management sector at governmental agencies or within NGOs and IGOs. Marine biologists also work at museums, zoological gardens, and aquaria. Additional employment opportunities for marine biologists include adjacent fields such as media (i.e. scientific journalism), eco-consulting, environmental impact assessments, and eco-tourism business. Marine biologists are also employed in the commercial and industrial sector, for instance for "Blue Biotechnology", coastal zone management and the sustainable use of marine resources.</p>]

Upvotes: 0

Related Questions