Reputation: 27
I'm a complete newbie. I train in parsing sites from tasks on upwork. The problem arose as follows: the list of goods is returned in full but not the price list, there is no price 'text' for the new aircraft. Lists then need to be combined into a table, everything goes wrong without an equal number of elements in lists.
Please help me learn how to handle such exceptions so that an empty string appears in the final list in this case. Thanks in advance for your answers.
import requests
import lxml.html
def parse_data(url):
try:
response = requests.get(url)
except:
return
tree = lxml.html.document_fromstring(response.text)
text_aicraft = tree.xpath('//*[contains(@id, "item_card")]/div/div[4]/div/h2/a/text()')
price_aicraft = tree.xpath('//*[contains(@id, "item_card")]/div/div[4]/div/div[1]/text()')
print(text_aicraft)
print(len(text_aicraft))
print(price_aicraft)
print(len(price_aicraft))
def main():
url = 'https://www.avbuyer.com/aircraft/private-jets/page-13'
parse_data(url)
if __name__ == "__main__":
main()
Upvotes: 1
Views: 105
Reputation: 195643
This script will iterate over each item and if there isn't any price, it replaces it with N/A
:
import requests
import lxml.html
def parse_data(url):
try:
response = requests.get(url)
except:
return
tree = lxml.html.document_fromstring(response.text)
for item in tree.xpath('//*[contains(@class, "list-item-details")]'):
title = item.xpath(".//h2/a/text()")[0]
price = item.xpath('.//*[contains(@class, "price")]/text()')
price = price[0] if price else "N/A"
print("{:<40} {:<20}".format(title, price))
def main():
url = "https://www.avbuyer.com/aircraft/private-jets/page-13"
parse_data(url)
if __name__ == "__main__":
main()
Prints:
Dassault Falcon 50EX Deal pending
Cessna Citation M2 Please call
Embraer Phenom 300 Please call
Bombardier Learjet 40XR Please call
Embraer Legacy 600 Please call
Cessna Citation Sovereign Price: USD $6,500,000
Cessna Citation Ultra Please call
Cessna Citation Ultra Please call
Airbus ACJ318 Make offer
Gulfstream G550 Please call
Boeing 737 -500 Please call
Boeing BBJ Make offer
Hawker 800XP Please call
Boeing 737 Price: USD $3,500,000
Bombardier Learjet 55 Please email
Bombardier Challenger 300 Make offer
Airbus ACJ TwoTwenty N/A
Gulfstream G200 Please call
Bombardier Learjet 60XR Deal pending
Cessna Citation Mustang Price: USD $1,200,000
Upvotes: 1
Reputation: 303
One option would be to split the parsing into two steps.
Step1 - extract the elements. Step2 - extract the text out of elements
Python list comprehension returns a None when the element is empty.
import requests
import lxml.html
def parse_data(url):
try:
response = requests.get(url)
except:
return
tree = lxml.html.document_fromstring(response.text)
text_aicraft = tree.xpath('//*[contains(@id, "item_card")]/div/div[4]/div/h2/a/text()')
price_aicraft_elements = tree.xpath('//*[contains(@id, "item_card")]/div/div[4]/div/div[1]')
price_aicraft = [element.text for element in price_aicraft_elements]
print(text_aicraft)
print(len(text_aicraft))
print(price_aicraft)
print(len(price_aicraft))
def main():
url = 'https://www.avbuyer.com/aircraft/private-jets/page-13'
parse_data(url)
if __name__ == "__main__":
main()
Output:
['Dassault Falcon 50EX ', 'Cessna Citation M2 ', 'Embraer Phenom 300 ', 'Bombardier Learjet 40XR ', 'Embraer Legacy 600 ', 'Cessna Citation Sovereign ', 'Cessna Citation Ultra ', 'Cessna Citation Ultra ', 'Airbus ACJ318 ', 'Gulfstream G550 ', 'Boeing 737 -500', 'Boeing BBJ ', 'Hawker 800XP ', 'Boeing
737 ', 'Bombardier Learjet 55 ', 'Bombardier Challenger 300 ', 'Airbus ACJ TwoTwenty ', 'Gulfstream G200 ', 'Bombardier Learjet 60XR ', 'Cessna Citation Mustang ']
20
['Deal pending', 'Please call ', 'Please call ', 'Please call ', 'Please call ', 'Price: USD $6,500,000', 'Please call ', 'Please call ', 'Make offer', 'Please call ', 'Please call ', 'Make offer', 'Please call ', 'Price: USD $3,500,000', 'Please email', 'Make offer', None, 'Please call ', 'Deal pend
ing', 'Price: USD $1,200,000']
20
Upvotes: 1