Reputation: 1
In Python I have a string containing the sourcecode of a website. Within this sourcecode I want to get the link within an tag, if the tag contains a specific substring.
The input e.g. looks like this:
AnyKindOfString <a href="http://www.link-to-get.com">SearchString</a> AndEvenMoreString
So what I want to tell Python is to search for SearchString
in the all tags within string and give me the first found http://www.link-to-get.com
back.
This should only work, if SearchString
is within the tag - and it should also work, if "SearchString" is part (substring) of http://www.link-to-get.com
.
I'm searching for an answer like more than 30 minutes know and the only thing I found for Python was simply to extract every (or only external or only internal) links from a string.
Anyone having an idea?
Thx in advance!
Upvotes: 0
Views: 299
Reputation: 229
using BeautifulSoup 3.2.1 with python 2.7
from BeautifulSoup import BeautifulSoup
search_string = 'SearchString'
website_source = '<a href="http://www.link-to-get.com">SearchString</a> <a href="http://www.link-to-get.com">OtherString</a>\
<a href="http://www.link-to-getSearchString.com">otherString</a>'
soup = BeautifulSoup(website_source)
# this will return a list of lists that has the url's and the name for the link
anchors = [[row['href'], row.text] for row in soup.findAll('a') if row['href'].find(search_string) <> -1 or search_string in row.text]
# prints whole list
print anchors
#prints first list
print anchors[0]
# prints the url for the first list
print anchors[0][0]
The issue seems to be that I tested the above with BeautifulSoup 3.2.1 which only works in python 2.x and you are using python 3.4 hence the error.
If you install BeautifulSoup4 and try the below code it should work. also to note that BeautifulSoup4 which works in both 2.x and 3.x.
Please note that the below has not been tested.
from bs4 import BeautifulSoup
search_string = 'SearchString'
website_source = '<a href="http://www.link-to-get.com">SearchString</a> <a href="http://www.link-to-get.com">OtherString</a>\
<a href="http://www.link-to-getSearchString.com">otherString</a>'
soup = BeautifulSoup(website_source)
# this will return a list of lists that has the url's and the name for the link
anchors = [[row['href'], row.text] for row in soup.findAll('a') if row['href'].find(search_string) != -1 or search_string in row.text]
# prints whole list
print(anchors)
# prints first list
print(anchors[0])
# prints the url for the first list
print(anchors[0][0])
Upvotes: 1
Reputation: 722
Can be done with the help of pyquery
(http://pythonhosted.org/pyquery/index.html) + lxml
(http://lxml.de/tutorial.html) as follows
from pyquery import PyQuery as pq
from lxml import etree
pq_obj = pq(etree.fromstring('<body><p>AnyKindOfString <a href="http://www.link-to-get.com">SearchString</a> AndEvenMoreString</p><p>this is another string goes here</p><a> other</a></body>'))
search_string = 'SearchString'
links = pq_obj('a')
for link in links:
if search_string in link.text:
attrib = link.attrib
print attrib.get('href')
# output
# http://www.link-to-get.com
Upvotes: 0
Reputation: 159
I've roughed up some code that should work, at least it works on the example string you gave.
myString = 'AnyKindOfString <a href="http://www.link-to-get.com">SearchString</a> AndEvenMoreString'
theLimit = len(myString)
searchStringLinkPairs = []
tempStr = myString[:]
i =0
while i < theLimit:
startLoc = tempStr.find('<a')
endLoc = tempStr.find("</a")
print startLoc,"\t",endLoc
subStr = tempStr[startLoc:endLoc]
startLink = subStr.find("\"")
subTwo = subStr[startLink+1:]
endLink = subTwo.find("\"")
myLink = subStr[startLink+1:startLink+1+endLink]
searchStringStart = subStr.find(">")
searchString = subStr[searchStringStart+1:endLoc]
if myLink != "" and searchString != "":
searchStringLinkPairs.append([myLink, searchString])
tempStr = tempStr[endLoc+1:]
i = endLoc
if startLoc == -1 or endLoc == -1:
i = 10 * theLimit
print searchStringLinkPairs
Upvotes: 0