Reputation: 22440
I've written a script in python to get a specific link using a certain search from a webpage. The thing is I'm getting four links as result. However, I expect to get only the first link that matches the search criterion no matter how many links of same kind are there.
Here is my effort so far:
import requests
from lxml.html import fromstring
main_url = "http://www.excel-easy.com/vba.html"
def search_item(url):
response = requests.get(url)
tree = fromstring(response.text)
for item in tree.cssselect("a"):
try:
if "excel" in item.text.lower():
url_link = item.attrib['href']
print(url_link)
except: pass
search_item(main_url)
The result i'm getting:
http://www.excel-easy.com
http://www.excel-easy.com
http://www.excel-easy.com
http://www.excel-easy.com/introduction/formulas-functions.html
The result I'm after (only the first one):
http://www.excel-easy.com
I tried with item[0].attrib['href']
but this is obviously not a valid expression. Any help on this will be appreciated.
Upvotes: 0
Views: 78
Reputation: 21663
You could use an xpath expression instead.
>>> import requests
>>> from lxml import html
>>> url = "http://www.excel-easy.com/vba.html"
>>> response = requests.get(url).content
>>> tree = html.fromstring(response)
Having parsed the html, get the list of href's for all of the links in the page and loop through them. Watch for one that once converted to lowercase contains 'excel': exhibit that href and quit the loop.
>>> for item in tree.xpath('.//a/@href'):
... if 'excel' in item.lower():
... item
... break
...
'http://www.excel-easy.com'
Upvotes: 1
Reputation: 1066
I originally used a list comprehension, but I think this is easier to read as a for loop. It got a little too crammed with the filters in the list comprehension. I don't think you need to try/catch block in this one either. It will fail the if statement if "href" isn't in the attributes.
import requests
from lxml.html import fromstring
main_url = "http://www.excel-easy.com/vba.html"
def search_item(url):
response = requests.get(url)
tree = fromstring(response.text)
matched = []
for element in tree.cssselect("a"):
if "href" in element.attrib and "excel" in element.attrib['href'].lower():
matched.append(element)
if matched:
return matched[0].attrib['href']
else:
return None
print(search_item(main_url))
Upvotes: 0