Reputation: 187
I am trying to scrape a site that you will find its link below in the code
The goal is to get the data from within the attributes since there is no text while inspecting the code
Here is the full XPath of an element:
/html/body/div[2]/div[3]/div/div[3]/section[1]/div/div[2]/div[1]
and the code:
import requests
from lxml import html
page = requests.get('https://www.meilleursagents.com/annonces/achat/nice-06000/appartement/')
tree = html.fromstring(page.content)
trying to scrape the attribute 'data-wa-data' value with:
tree.xpath('/html/body/div[2]/div[3]/div/div[3]/section[1]/div/div[2]/div[1]/@data-wa-data')
is yielding empty values
and the same issue is for another element that has a text:
tree.xpath('/html/body/div[2]/div[3]/div/div[3]/section[1]/div/div[2]/div[1]/div/a/div[1]/text()')
Upvotes: 0
Views: 100
Reputation: 620
The problem is that this website requires the User-Agent
to download the complete HTML
which is absent in your case. So, to get the complete page pass user-agent as a header.
Note: This website is more aggressive when it comes to blocking. I mean, you cannot even make two consecutive requests with the same user-agent. Thus, my advice would be to rotate the proxies and user-agent. Moreover, also add download delay between each requests to avoid hitting server rapidly.
Code
import requests
from lxml import html
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0'
}
page = requests.get('https://www.meilleursagents.com/annonces/achat/nice-06000/appartement/', headers=headers)
tree = html.fromstring(page.content)
print(tree.xpath('//div[@class="listing-item search-listing-result__item"]/@data-wa-data'))
output
['listing_id=1971029217|realtor_id=21407|source=listings_results', 'listing_id=1971046117|realtor_id=74051|source=listings_results', 'listing_id=1971051280|realtor_id=71648|source=listings_results', 'listing_id=1971053639|realtor_id=21407|source=listings_results', 'listing_id=1971053645|realtor_id=38087|source=listings_results', 'listing_id=1971053650|realtor_id=29634|source=listings_results', 'listing_id=1971053651|realtor_id=29634|source=listings_results', 'listing_id=1971053652|realtor_id=29634|source=listings_results', 'listing_id=1971053656|realtor_id=39588|source=listings_results', 'listing_id=1971053658|realtor_id=39588|source=listings_results', 'listing_id=1971053661|realtor_id=39588|source=listings_results', 'listing_id=1971053662|realtor_id=39588|source=listings_results']
Upvotes: 1