Kopatych
Kopatych

Reputation: 27

cannot get empty string to list from Xpath

I'm a complete newbie. I train in parsing sites from tasks on upwork. The problem arose as follows: the list of goods is returned in full but not the price list, there is no price 'text' for the new aircraft. Lists then need to be combined into a table, everything goes wrong without an equal number of elements in lists.

Please help me learn how to handle such exceptions so that an empty string appears in the final list in this case. Thanks in advance for your answers.

import requests
import lxml.html


def parse_data(url):
    try:
        response = requests.get(url)
    except:
        return
    tree = lxml.html.document_fromstring(response.text)
    text_aicraft = tree.xpath('//*[contains(@id, "item_card")]/div/div[4]/div/h2/a/text()')
    price_aicraft = tree.xpath('//*[contains(@id, "item_card")]/div/div[4]/div/div[1]/text()')
    print(text_aicraft)
    print(len(text_aicraft))
    print(price_aicraft)
    print(len(price_aicraft))


def main():
    url = 'https://www.avbuyer.com/aircraft/private-jets/page-13'
    parse_data(url)


if __name__ == "__main__":
    main()

Upvotes: 1

Views: 105

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195643

This script will iterate over each item and if there isn't any price, it replaces it with N/A:

import requests
import lxml.html


def parse_data(url):
    try:
        response = requests.get(url)
    except:
        return
    tree = lxml.html.document_fromstring(response.text)
    for item in tree.xpath('//*[contains(@class, "list-item-details")]'):
        title = item.xpath(".//h2/a/text()")[0]
        price = item.xpath('.//*[contains(@class, "price")]/text()')
        price = price[0] if price else "N/A"

        print("{:<40} {:<20}".format(title, price))

def main():
    url = "https://www.avbuyer.com/aircraft/private-jets/page-13"
    parse_data(url)


if __name__ == "__main__":
    main()

Prints:

Dassault Falcon 50EX                     Deal pending        
Cessna Citation M2                       Please call         
Embraer Phenom 300                       Please call         
Bombardier Learjet 40XR                  Please call         
Embraer Legacy 600                       Please call         
Cessna Citation Sovereign                Price: USD $6,500,000
Cessna Citation Ultra                    Please call         
Cessna Citation Ultra                    Please call         
Airbus ACJ318                            Make offer          
Gulfstream G550                          Please call         
Boeing 737 -500                          Please call         
Boeing BBJ                               Make offer          
Hawker 800XP                             Please call         
Boeing 737                               Price: USD $3,500,000
Bombardier Learjet 55                    Please email        
Bombardier Challenger 300                Make offer          
Airbus ACJ TwoTwenty                     N/A                 
Gulfstream G200                          Please call         
Bombardier Learjet 60XR                  Deal pending        
Cessna Citation Mustang                  Price: USD $1,200,000

Upvotes: 1

jay.cs
jay.cs

Reputation: 303

One option would be to split the parsing into two steps.

Step1 - extract the elements. Step2 - extract the text out of elements

Python list comprehension returns a None when the element is empty.

import requests
import lxml.html


def parse_data(url):
    try:
        response = requests.get(url)
    except:
        return
    tree = lxml.html.document_fromstring(response.text)
    text_aicraft = tree.xpath('//*[contains(@id, "item_card")]/div/div[4]/div/h2/a/text()')
    price_aicraft_elements = tree.xpath('//*[contains(@id, "item_card")]/div/div[4]/div/div[1]')
    price_aicraft =  [element.text for element in price_aicraft_elements]
    print(text_aicraft)
    print(len(text_aicraft))
    print(price_aicraft)
    print(len(price_aicraft))


def main():
    url = 'https://www.avbuyer.com/aircraft/private-jets/page-13'
    parse_data(url)


if __name__ == "__main__":
    main()

Output:

['Dassault Falcon 50EX ', 'Cessna Citation M2 ', 'Embraer Phenom 300 ', 'Bombardier Learjet 40XR ', 'Embraer Legacy 600 ', 'Cessna Citation Sovereign ', 'Cessna Citation Ultra ', 'Cessna Citation Ultra ', 'Airbus ACJ318 ', 'Gulfstream G550 ', 'Boeing 737 -500', 'Boeing BBJ ', 'Hawker 800XP ', 'Boeing
 737 ', 'Bombardier Learjet 55 ', 'Bombardier Challenger 300 ', 'Airbus ACJ TwoTwenty ', 'Gulfstream G200 ', 'Bombardier Learjet 60XR ', 'Cessna Citation Mustang ']
20
['Deal pending', 'Please call ', 'Please call ', 'Please call ', 'Please call ', 'Price: USD $6,500,000', 'Please call ', 'Please call ', 'Make offer', 'Please call ', 'Please call ', 'Make offer', 'Please call ', 'Price: USD $3,500,000', 'Please email', 'Make offer', None, 'Please call ', 'Deal pend
ing', 'Price: USD $1,200,000']
20

Upvotes: 1

Related Questions