Reputation: 570
I am new to Xpath, trying to scrapy website with below format:
<div class="top">
<a> tittle_name </a>
<div class="middle"> listed_date </div>
<div class="middle"> listed_value </div>
</div>
<div class="top">
<a> tittle_name </a>
<div class="middle"> listed_date </div>
</div>
<div class="top">
<a> tittle_name </a>
<div class="middle"> listed_value </div>
</div>
The presences of listed_value & listed_date are optional.
I need to group each tittle_name with respective listed_date, listed_value (if available) then insert reach record to MySQL.
I am using scrapy shell which gives some basic examples like
listings = hxs.select('//div[@class=\'top\']')
for listing in listings:
tittle_name = listing.select('/a//text()').extract()
date_values = listing.select('//div[@class=\'middle\']')
Above code give me list of tittle_name and list of available listed_date, listed_value, but how to match them? (we cannot go by index because the format is not symmetric).
Thanks.
Upvotes: 0
Views: 1294
Reputation:
Do note that those XPath expressions are absolute:
/a//text()
//div[@class=\'middle\']
You would need relative XPath expression like these:
a
div[@class=\'middle\']
Second. It's not a good idea to select text nodes in a mixed content model like (X)HTML. You should extract the string value with the proper DOM method or with string()
function. (In the last case, you would need to eval the expression for each node because the implicit node set casting into singleton node set)
Upvotes: 1
Reputation: 9437
Well, since the website doesn't specify whether something in a div[@class='middle']
is a date or a value, you'll have to code your own way of deciding this.
I guess the dates have some specific format that you could match with some analysis, maybe using a regular expression.
Can you maybe be more specific on what are possible values for listed_date
and listed_value
?
Upvotes: 0