Reputation: 783
Using BeautifulSoup, I would like to return only "a" tags containing "Company" and not "Sector" in their href string. Is there a way to use regex inside of re.compile() to return only Companies and not Sectors?
Code:
soup = soup.findAll('tr')[5].findAll('a')
print(soup)
Output
[<a class="example" href="../ref/index.htm">Example</a>,
<a href="?Company=FB">Facebook</a>,
<a href="?Company=XOM">Exxon</a>,
<a href="?Sector=5">Technology</a>,
<a href="?Sector=3">Oil & Gas</a>]
Using this method:
import re
soup.findAll('a', re.compile("Company"))
Returns:
AttributeError: 'ResultSet' object has no attribute 'findAll'
But I would like it to return (without the Sectors):
[<a href="?Company=FB">Facebook</a>,
<a href="?Company=XOM">Exxon</a>]
Using:
Upvotes: 2
Views: 4502
Reputation: 1346
Another approach is xpath, which supports AND/NOT operations for querying by attributes in an XML document. Unfortunately, BeautifulSoup doesn't handle xpath itself, but lxml can:
from lxml.html import fromstring
import requests
r = requests.get("YourUrl")
tree = fromstring(r.text)
#get elements with company in the URL but excludes ones with Sector
a_tags = tree.xpath("//a[contains(@href,'?Company') and not(contains(@href, 'Sector'))]")
Upvotes: 2
Reputation: 1297
Using soup = soup.findAll('tr')[5].findAll('a')
and then soup.findAll('a', re.compile("Company"))
writes over the original soup variable. findAll
returns a ResultSet that is basically an array of BeautifulSoup objects. Try using the following to get all of the "Company" links instead.
links = soup.findAll('tr')[5].findAll('a', href=re.compile("Company"))
To get the text contained in these tags:
companies = [link.text for link in links]
Upvotes: 3
Reputation: 783
Thanks for the above answers @Padriac Cunningham and @Wyatt I !! This is a less elegant solution I came up with:
import re
for i in range(1, len(soup)):
if re.search("Company" , str(soup[i])):
print(soup[i])
Upvotes: 1
Reputation: 180411
You can use a css selector getting all the a tags where the href starts with ?Company
:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
a = soup.select("a[href^=?Company]")
If you want them just from the sixth tr you can use nth-of-type:
.select("tr:nth-of-type(6) a[href^=?Company]"))
Upvotes: 1