Bob Hopez
Bob Hopez

Reputation: 783

Extracting 'a' tags containing specific substring with Python's BeautifulSoup

Using BeautifulSoup, I would like to return only "a" tags containing "Company" and not "Sector" in their href string. Is there a way to use regex inside of re.compile() to return only Companies and not Sectors?

Code:

soup = soup.findAll('tr')[5].findAll('a') print(soup)

Output

[<a class="example" href="../ref/index.htm">Example</a>,  
<a href="?Company=FB">Facebook</a>,  
<a href="?Company=XOM">Exxon</a>,  
<a href="?Sector=5">Technology</a>,  
<a href="?Sector=3">Oil & Gas</a>]  

Using this method:

import re soup.findAll('a', re.compile("Company"))

Returns:

AttributeError: 'ResultSet' object has no attribute 'findAll'

But I would like it to return (without the Sectors):

[<a href="?Company=FB">Facebook</a>,
<a href="?Company=XOM">Exxon</a>]

Using:

Upvotes: 2

Views: 4502

Answers (4)

Matt O
Matt O

Reputation: 1346

Another approach is xpath, which supports AND/NOT operations for querying by attributes in an XML document. Unfortunately, BeautifulSoup doesn't handle xpath itself, but lxml can:

from lxml.html import fromstring
import requests

r = requests.get("YourUrl")
tree = fromstring(r.text)
#get elements with company in the URL but excludes ones with Sector
a_tags = tree.xpath("//a[contains(@href,'?Company') and not(contains(@href, 'Sector'))]")

Upvotes: 2

wyattis
wyattis

Reputation: 1297

Using soup = soup.findAll('tr')[5].findAll('a') and then soup.findAll('a', re.compile("Company")) writes over the original soup variable. findAll returns a ResultSet that is basically an array of BeautifulSoup objects. Try using the following to get all of the "Company" links instead.

links = soup.findAll('tr')[5].findAll('a', href=re.compile("Company"))

To get the text contained in these tags:

companies = [link.text for link in links]

Upvotes: 3

Bob Hopez
Bob Hopez

Reputation: 783

Thanks for the above answers @Padriac Cunningham and @Wyatt I !! This is a less elegant solution I came up with:

import re
for i in range(1, len(soup)):
    if re.search("Company" , str(soup[i])):
        print(soup[i])

Upvotes: 1

Padraic Cunningham
Padraic Cunningham

Reputation: 180411

You can use a css selector getting all the a tags where the href starts with ?Company:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

a = soup.select("a[href^=?Company]")

If you want them just from the sixth tr you can use nth-of-type:

 .select("tr:nth-of-type(6) a[href^=?Company]"))

Upvotes: 1

Related Questions