souper
souper

Reputation: 1

web scrape python find all by text instead of find all by element tag

Let's use the word technology for my example. I want to search all text on a webpage. For each text, I want to find each element tags containing a string with the word "technology" and print only the contents of the element tag containing the word. Please help me figure this out.

words = soup.body.get_text()

for word in words:
   i = word.soup.find_all("technology")
   print(i)

Upvotes: 0

Views: 2647

Answers (2)

alecxe
alecxe

Reputation: 473903

You should use the search by text which can be accomplished by using the text argument (which was renamed to string in the modern BeautifulSoup versions), either via a function and substring in a string check:

for element in soup.find_all(text=lambda text: text and "technology" in text):
    print(element.get_text())

Or, via a regular expression pattern:

import re

for element in soup.find_all(text=re.compile("technology")):
    print(element.get_text())

Upvotes: 2

gman
gman

Reputation: 575

Since you are looking for data inside of an 'HTML structure' and not a typical data structure, you are going to have to nearly write an HTML parser for this job. Python doesn't normally know that "some string here" relates to another string wrapped in brackets somewhere else.

There may be a library for this, but I have a feeling that there isn't :(

Upvotes: 0

Related Questions