Harry
Harry

Reputation: 570

Scrapy, python, Xpath how to match respective items in html

I am new to Xpath, trying to scrapy website with below format:

<div class="top">
    <a> tittle_name </a>
    <div class="middle"> listed_date </div>
    <div class="middle"> listed_value </div>
</div>
<div class="top">
    <a> tittle_name </a>
    <div class="middle"> listed_date </div>
</div>
<div class="top">
    <a> tittle_name </a>
    <div class="middle"> listed_value </div>
</div>

The presences of listed_value & listed_date are optional.

I need to group each tittle_name with respective listed_date, listed_value (if available) then insert reach record to MySQL.

I am using scrapy shell which gives some basic examples like

listings = hxs.select('//div[@class=\'top\']')
for listing in listings:
    tittle_name = listing.select('/a//text()').extract()
    date_values = listing.select('//div[@class=\'middle\']')

Above code give me list of tittle_name and list of available listed_date, listed_value, but how to match them? (we cannot go by index because the format is not symmetric).

Thanks.

Upvotes: 0

Views: 1294

Answers (2)

user357812
user357812

Reputation:

Do note that those XPath expressions are absolute:

/a//text()

//div[@class=\'middle\']

You would need relative XPath expression like these:

a

div[@class=\'middle\']

Second. It's not a good idea to select text nodes in a mixed content model like (X)HTML. You should extract the string value with the proper DOM method or with string() function. (In the last case, you would need to eval the expression for each node because the implicit node set casting into singleton node set)

Upvotes: 1

Ptival
Ptival

Reputation: 9437

Well, since the website doesn't specify whether something in a div[@class='middle'] is a date or a value, you'll have to code your own way of deciding this.

I guess the dates have some specific format that you could match with some analysis, maybe using a regular expression.

Can you maybe be more specific on what are possible values for listed_date and listed_value?

Upvotes: 0

Related Questions