DrBug
DrBug

Reputation: 2024

What beautiful soup findall regex string to use?

I have links in HTML of the form

<a href="/downloadsServlet?docid=abc" target="_blank">Report 1</a>
<a href="/downloadsServlet?docid=ixyz" target="_blank">Fetch Report 2 </a>

I am able to get a list of links of the above form using BeautifulSoup

My code is as follows

from bs4 import BeautifulSoup
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page)
listOfLinks = list(soup.findall('a'))

However, I want to find the links which have the word "Fetch" in the text referencing the link.

I tried the form

soup.findAll('a', re.compile(".*Fetch.*"))

But that is not working. How do I select only the tags a which have an href and the text portion has the word "Fetch" in it ?

Upvotes: 2

Views: 9165

Answers (2)

宏杰李
宏杰李

Reputation: 12168

import re
soup.findAll('a', text = re.compile("Fetch"))

you can use regex as filter, it will use re.search method to filter our the tag.

text/string are text value of the tag, text = re.compile("Fetch") means find the tag which text value contains 'Fetch'

Document

and one more thing, use find_all() or findAll(), findall() is not a key word in bs4

Upvotes: 6

DYZ
DYZ

Reputation: 57033

A regex may be an overkill here, but it allows for possible extensions:

def criterion(tag):
  return tag.has_attr('href') and re.search('Fetch', tag.text)

soup.findAll(criterion)
# [<a href="/downloadsServlet?docid=ixyz" target="_blank">Fetch Report 2 </a>]

Upvotes: 7

Related Questions