Atilla W.
Atilla W.

Reputation: 1

Python: Find specific link within HTML <a> tag

In Python I have a string containing the sourcecode of a website. Within this sourcecode I want to get the link within an tag, if the tag contains a specific substring.

The input e.g. looks like this:

AnyKindOfString <a href="http://www.link-to-get.com">SearchString</a> AndEvenMoreString

So what I want to tell Python is to search for SearchString in the all tags within string and give me the first found http://www.link-to-get.com back.

This should only work, if SearchString is within the tag - and it should also work, if "SearchString" is part (substring) of http://www.link-to-get.com.

I'm searching for an answer like more than 30 minutes know and the only thing I found for Python was simply to extract every (or only external or only internal) links from a string.

Anyone having an idea?

Thx in advance!

Upvotes: 0

Views: 299

Answers (3)

Michael Moura
Michael Moura

Reputation: 229

using BeautifulSoup 3.2.1 with python 2.7

from BeautifulSoup import BeautifulSoup

search_string = 'SearchString'

website_source = '<a href="http://www.link-to-get.com">SearchString</a> <a href="http://www.link-to-get.com">OtherString</a>\
                  <a href="http://www.link-to-getSearchString.com">otherString</a>'

soup = BeautifulSoup(website_source)

# this will return a list of lists that has the url's and the name for the link
anchors = [[row['href'], row.text] for row in soup.findAll('a') if row['href'].find(search_string) <> -1 or search_string in row.text]

# prints whole list
print anchors

#prints first list
print anchors[0]

# prints the url for the first list
print anchors[0][0]

The issue seems to be that I tested the above with BeautifulSoup 3.2.1 which only works in python 2.x and you are using python 3.4 hence the error.
If you install BeautifulSoup4 and try the below code it should work. also to note that BeautifulSoup4 which works in both 2.x and 3.x.

Please note that the below has not been tested.

from bs4 import BeautifulSoup

search_string = 'SearchString'

website_source = '<a href="http://www.link-to-get.com">SearchString</a> <a href="http://www.link-to-get.com">OtherString</a>\
                  <a href="http://www.link-to-getSearchString.com">otherString</a>'

soup = BeautifulSoup(website_source)

# this will return a list of lists that has the url's and the name for the link
anchors = [[row['href'], row.text] for row in soup.findAll('a') if row['href'].find(search_string) != -1 or search_string in row.text]

# prints whole list
print(anchors)

# prints first list
print(anchors[0])

# prints the url for the first list
print(anchors[0][0])

Upvotes: 1

Shaikhul
Shaikhul

Reputation: 722

Can be done with the help of pyquery(http://pythonhosted.org/pyquery/index.html) + lxml(http://lxml.de/tutorial.html) as follows

from pyquery import PyQuery as pq
from lxml import etree

pq_obj = pq(etree.fromstring('<body><p>AnyKindOfString <a href="http://www.link-to-get.com">SearchString</a> AndEvenMoreString</p><p>this is another string goes here</p><a> other</a></body>'))
search_string = 'SearchString'

links = pq_obj('a')
for link in links:
    if search_string in link.text:
        attrib = link.attrib
        print attrib.get('href')

# output
# http://www.link-to-get.com

Upvotes: 0

russOnXMaps
russOnXMaps

Reputation: 159

I've roughed up some code that should work, at least it works on the example string you gave.

myString = 'AnyKindOfString <a href="http://www.link-to-get.com">SearchString</a> AndEvenMoreString'

theLimit = len(myString)
searchStringLinkPairs = []
tempStr = myString[:]
i =0


while i < theLimit:
    startLoc = tempStr.find('<a')
    endLoc = tempStr.find("</a")
    print startLoc,"\t",endLoc
    subStr = tempStr[startLoc:endLoc]
    startLink = subStr.find("\"")
    subTwo = subStr[startLink+1:]
    endLink = subTwo.find("\"")
    myLink = subStr[startLink+1:startLink+1+endLink]

    searchStringStart = subStr.find(">")
    searchString = subStr[searchStringStart+1:endLoc]

    if myLink != "" and searchString != "":
        searchStringLinkPairs.append([myLink, searchString])
    tempStr = tempStr[endLoc+1:]
    i = endLoc
    if startLoc == -1 or endLoc == -1:
        i = 10 * theLimit

print searchStringLinkPairs

Upvotes: 0

Related Questions