SIM
SIM

Reputation: 22440

Unable to get the first link from a certain search

I've written a script in python to get a specific link using a certain search from a webpage. The thing is I'm getting four links as result. However, I expect to get only the first link that matches the search criterion no matter how many links of same kind are there.

Here is my effort so far:

import requests
from lxml.html import fromstring

main_url = "http://www.excel-easy.com/vba.html"

def search_item(url):
    response = requests.get(url)
    tree = fromstring(response.text)
    for item in tree.cssselect("a"):
        try:
            if "excel" in item.text.lower():
                url_link = item.attrib['href']
                print(url_link)
        except: pass    

search_item(main_url)

The result i'm getting:

http://www.excel-easy.com
http://www.excel-easy.com
http://www.excel-easy.com
http://www.excel-easy.com/introduction/formulas-functions.html

The result I'm after (only the first one):

http://www.excel-easy.com

I tried with item[0].attrib['href'] but this is obviously not a valid expression. Any help on this will be appreciated.

Upvotes: 0

Views: 78

Answers (2)

Bill Bell
Bill Bell

Reputation: 21663

You could use an xpath expression instead.

>>> import requests
>>> from lxml import html
>>> url = "http://www.excel-easy.com/vba.html"
>>> response = requests.get(url).content
>>> tree = html.fromstring(response)

Having parsed the html, get the list of href's for all of the links in the page and loop through them. Watch for one that once converted to lowercase contains 'excel': exhibit that href and quit the loop.

>>> for item in tree.xpath('.//a/@href'):
...     if 'excel' in item.lower():
...         item
...         break
...     
'http://www.excel-easy.com'

Upvotes: 1

Kyle
Kyle

Reputation: 1066

I originally used a list comprehension, but I think this is easier to read as a for loop. It got a little too crammed with the filters in the list comprehension. I don't think you need to try/catch block in this one either. It will fail the if statement if "href" isn't in the attributes.

import requests
from lxml.html import fromstring

main_url = "http://www.excel-easy.com/vba.html"

def search_item(url):

    response = requests.get(url)
    tree = fromstring(response.text)
    matched = []

    for element in tree.cssselect("a"):
        if "href" in element.attrib and "excel" in element.attrib['href'].lower():
            matched.append(element)

    if matched:
        return matched[0].attrib['href']
    else:
        return None

print(search_item(main_url))

Upvotes: 0

Related Questions