eLudium
eLudium

Reputation: 196

BeautifulSoup - Get Text within tag only if a certain string is found

I am trying to scrape some scripts from a TV-Show. I am able to get the text as I need it using BeautifulSoup and Requests.

import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.example.com')
s = BeautifulSoup(r.text, 'html.parser')

for p in s.find_all('p'):
    print p.text

This works great so far. But I want only those paragraphs from a certain character. Say his name is "stackoverflow". The text would be like this:

A: sdasd sd asda B: sdasds STACKOVERFLOW: Help?

So I only want the stuff that STACKOVERFLOW says. Not the rest.

I have tried

s.find_all(text='STACKOVERFLOW') but I get nothing.

What would be the right way to do this? A hint in the right direction would be most appreciated.

Upvotes: 3

Views: 1604

Answers (2)

sytech
sytech

Reputation: 40861

You can make a custom function to pass into find_all. This function should take in one argument (tag) and return True for the tags that meet your criteria.

def so_tags(tag):
    '''returns True if the tag has text and 'stackoverflow' is in the text'''
    return (tag.text and "STACKOVERFLOW" in tag.text)

soup.find_all(my_tags)

You could also make a function factory to make it a bit more dynamic.

def user_paragraphs(user):
    '''returns a function'''
    def user_tags(tag):
        '''returns True for tags that have <user> in the text'''
        return (tag.text and user in tag.text)
    return user_tags

for user in user_list:
    user_posts = soup.find_all(user_paragraphs(user))

Upvotes: 0

alecxe
alecxe

Reputation: 473763

Make the partial text match, either with:

s.find_all(text=lambda text: text and 'STACKOVERFLOW' in text)

Or:

import re

s.find_all(text=re.compile('STACKOVERFLOW'))

Upvotes: 2

Related Questions