Reputation: 196
I am trying to scrape some scripts from a TV-Show. I am able to get the text as I need it using BeautifulSoup and Requests.
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.example.com')
s = BeautifulSoup(r.text, 'html.parser')
for p in s.find_all('p'):
print p.text
This works great so far. But I want only those paragraphs from a certain character. Say his name is "stackoverflow". The text would be like this:
A: sdasd sd asda B: sdasds STACKOVERFLOW: Help?
So I only want the stuff that STACKOVERFLOW says. Not the rest.
I have tried
s.find_all(text='STACKOVERFLOW') but I get nothing.
What would be the right way to do this? A hint in the right direction would be most appreciated.
Upvotes: 3
Views: 1604
Reputation: 40861
You can make a custom function to pass into find_all
. This function should take in one argument (tag) and return True
for the tags that meet your criteria.
def so_tags(tag):
'''returns True if the tag has text and 'stackoverflow' is in the text'''
return (tag.text and "STACKOVERFLOW" in tag.text)
soup.find_all(my_tags)
You could also make a function factory to make it a bit more dynamic.
def user_paragraphs(user):
'''returns a function'''
def user_tags(tag):
'''returns True for tags that have <user> in the text'''
return (tag.text and user in tag.text)
return user_tags
for user in user_list:
user_posts = soup.find_all(user_paragraphs(user))
Upvotes: 0
Reputation: 473763
Make the partial text match, either with:
s.find_all(text=lambda text: text and 'STACKOVERFLOW' in text)
Or:
import re
s.find_all(text=re.compile('STACKOVERFLOW'))
Upvotes: 2