Je Je
Je Je

Reputation: 572

Find a tag using text it contains using BeautifulSoup

I am trying to webscrape some parts of this page: https://markets.businessinsider.com/stocks/bp-stock using BeautifulSoup to search for some text contained in h2 title of tables

when i do:

data_table = soup.find('h2', text=re.compile('RELATED STOCKS')).find_parent('div').find('table')

It correctly get the table I am after.

When I try to get the table "Analyst Opinion" using the similar line, it returns None:

data_table = soup.find('h2', text=re.compile('ANALYST OPINIONS')).find_parent('div').find('table')

I am guessing that there might be some special characters in the html code, that provides re to function as expected. I tried this too:

data_table = soup.find('h2', text=re.compile('.*?STOCK.*?INFORMATION.*?', re.DOTALL))

without success.

I would like to get the table that contain this bit of text "Analyst Opinion" without finding all tables but by checking if contains my requested text.

Any idea will be highly appreciated. Best

Upvotes: 0

Views: 157

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195573

You can use CSS selector to locate the <table>:

import requests
from bs4 import BeautifulSoup

url = 'https://markets.businessinsider.com/stocks/bp-stock '

soup = BeautifulSoup(requests.get(url).text, 'lxml')

table = soup.select_one('div:has(> h2:contains("Analyst Opinions")) table')

for tr in table.select('tr'):
    print(tr.get_text(strip=True, separator=' '))

Prints:

2/26/2018 BP Outperform RBC Capital Markets
9/22/2017 BP Outperform BMO Capital Markets

More about CSS selectors here.


EDIT: For canse-insensitive method, you can use bs4 API with regular expressions (note the flags=re.I). This is the equivalent of .select() method above:

import re
import requests
from bs4 import BeautifulSoup

url = 'https://markets.businessinsider.com/stocks/bp-stock '

soup = BeautifulSoup(requests.get(url).text, 'lxml')

h2 = soup.find(lambda t: t.name=='h2' and re.findall('analyst opinions', t.text, flags=re.I))
table = h2.find_parent('div').find('table')

for tr in table.select('tr'):
    print(tr.get_text(strip=True, separator=' '))

Upvotes: 1

Related Questions