Extracting string from a href tag with Python 2.7x

Question

I am currently using Beautifulsoup4 to extract 'a href' tags from a HTML page. I am using the find_all query in Beautifulsoup4 and it's working fine and returning the 'a href' tags I'm looking for. An example of what is returned is below:

"Pictures"

What I'm looking to now do however is simply extract " as opposed to the full content returned as above.



My code is below:

req = urllib2.Request(example_url)
response = urllib2.urlopen(req)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
for link in soup.find_all('a', href=True):
    # The below 'if' is to filter out only relevant 'a href' tags
    if "foldercontent.html?folder" in link['href']: 
        print link


Is this possible with modifying what I search for or would I have to run a regex across my returned string?

Martijn Pieters · Accepted Answer

You can use CSS selectors:

for link in soup.select('a[href*="foldercontent.html?folder"]'):

The [*=""] syntax matches any attribute value that contains the substring.

Note that you get Element objects returned, not strings; if you need to parse out specific information from the matched URL, you could parse the link['href'] value with the urlparse library to get just the URL path, or just the query string, or parse the query string into its constituent parts.

Extracting string from a href tag with Python 2.7x

Answers (1)

Related Questions