thefragileomen
thefragileomen

Reputation: 1547

Extracting string from a href tag with Python 2.7x

I am currently using Beautifulsoup4 to extract 'a href' tags from a HTML page. I am using the find_all query in Beautifulsoup4 and it's working fine and returning the 'a href' tags I'm looking for. An example of what is returned is below:

"<a href="manage/foldercontent.html?folder=Pictures" style="background-image: url(shares/Pictures/DefaultPicture.png)" target="content_window" title="Vaya al recurso compartido Pictures">Pictures</a>"

What I'm looking to now do however is simply extract "<a href="manage/foldercontent.html?folder=Pictures" as opposed to the full content returned as above.

My code is below:

req = urllib2.Request(example_url)
response = urllib2.urlopen(req)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
for link in soup.find_all('a', href=True):
    # The below 'if' is to filter out only relevant 'a href' tags
    if "foldercontent.html?folder" in link['href']: 
        print link

Is this possible with modifying what I search for or would I have to run a regex across my returned string?

Upvotes: 0

Views: 645

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121266

You can use CSS selectors:

for link in soup.select('a[href*="foldercontent.html?folder"]'):

The [<attribute>*="<substring>"] syntax matches any attribute value that contains the substring.

Note that you get Element objects returned, not strings; if you need to parse out specific information from the matched URL, you could parse the link['href'] value with the urlparse library to get just the URL path, or just the query string, or parse the query string into its constituent parts.

Upvotes: 4

Related Questions