Reputation: 3
I'm using "requests" and "beautifulsoup" to search for all the href links from a webpage with a specific text. I've already made it but if the text comes in a new line, beautifulsoup doesn't "see" it and don't return that link.
soup = BeautifulSoup(webpageAdress, "lxml")
path = soup.findAll('a', href=True, text="Something3")
print(path)
Example:
Like this, it returns Href of Something3 text:
...
<a href="page1/somethingC.aspx">Something3</a>
...
Like this, it doesn't return the Href of Something3 text:
...
<a href="page1/somethingC.aspx">
Something3</a>
...
The difference is that Href text (Something3) is in a new line. And i can't change HTML code because i'm not the webmaster of that webpage.
Any idea how can i solve that?
Note: i've already tried to use soup.replace('\n', ' ').replace('\r', '') but i get the error NoneType' object is not callable.
Upvotes: 0
Views: 1475
Reputation: 84455
You can use :contains
pseudo class with bs4 4.7.1
from bs4 import BeautifulSoup as bs
html = '<a href="page1/somethingC.aspx">Something3</a>'
soup = bs(html, 'lxml')
links = [link.text for link in soup.select('a:contains(Something3)')]
print(links)
Upvotes: 1
Reputation: 24930
And a solution without regex:
path = soup.select('a')
if path[0].getText().strip() == 'Something3':
print(path)
Output:
[<a href="page1/somethingC.aspx">
Something3</a>]
Upvotes: 0
Reputation: 28565
You can use regex to find any text that contains `"Something3":
html = '''<a href="page1/somethingC.aspx">Something3</a>
<a href="page1/somethingC.aspx">
Something3</a>'''
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html, "lxml")
path = soup.findAll('a', href=True, text=re.compile("Something3"))
for link in path:
print (link['href'])
Upvotes: 1