Reputation: 1547
I am currently using Beautifulsoup4 to extract 'a href' tags from a HTML page. I am using the find_all query in Beautifulsoup4 and it's working fine and returning the 'a href' tags I'm looking for. An example of what is returned is below:
"<a href="manage/foldercontent.html?folder=Pictures" style="background-image: url(shares/Pictures/DefaultPicture.png)" target="content_window" title="Vaya al recurso compartido Pictures">Pictures</a>"
What I'm looking to now do however is simply extract "<a href="manage/foldercontent.html?folder=Pictures"
as opposed to the full content returned as above.
My code is below:
req = urllib2.Request(example_url)
response = urllib2.urlopen(req)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
for link in soup.find_all('a', href=True):
# The below 'if' is to filter out only relevant 'a href' tags
if "foldercontent.html?folder" in link['href']:
print link
Is this possible with modifying what I search for or would I have to run a regex across my returned string?
Upvotes: 0
Views: 645
Reputation: 1121266
You can use CSS selectors:
for link in soup.select('a[href*="foldercontent.html?folder"]'):
The [<attribute>*="<substring>"]
syntax matches any attribute value that contains the substring.
Note that you get Element
objects returned, not strings; if you need to parse out specific information from the matched URL, you could parse the link['href']
value with the urlparse
library to get just the URL path, or just the query string, or parse the query string into its constituent parts.
Upvotes: 4