merlin
merlin

Reputation: 2897

How to get only one element in xpath per tag?

I am trying to extract attributes from a website, but get empty elements.

Using this code within srapy shell:

fetch('https://www.chronext.de/breitling/galactic/w7234812-a785-249s-a12d-4/C79467')

from w3lib.html import remove_tags
[remove_tags(w).strip() for w in response.xpath('//table[@class="compact margin-top-half"][1]/tr/td[2]/text()').extract()]

I am getting:

['C77316', '279175', 'Damen', 'Automatik', '28\xa0mm', 'Roségold', 'Roségold', 'Saphirglas', '', '', '', '2018', 'Originale Box', 'Originale Hersteller Papiere', 'CHRONEXT Echtheitszertifikat', 'Zusätzlich zur Herstellergarantie erhalten Sie eine 2-jährige CHRONEXT Garantie ab Kaufdatum.']

Which is surprising, as I aimed for the second box with /div[2] but received elements from both boxes instead.

I also tried this:

[x.strip() for x in response.xpath('//div[@class="row force-inside-container-behavior"]/div[2]/table/tr/td[2]/text()').extract()]

which returns this:

['', '', '', '2018', 'Originale Box', 'Originale Hersteller Papiere', 'CHRONEXT Echtheitszertifikat', 'Zusätzlich zur Herstellergarantie erhalten Sie eine 2-jährige CHRONEXT Garantie ab Kaufdatum.']

My goal is to get a dictionary of key/value pairs. E.g. "condition" = "good". The first box was without problems, then I thought let's get the second box seperatly and extend the list.

The key is not the problem, but the value I am trying to get returns those 3-4 empty elements which will bring the order out of sync once I pull the key/value together later on. Removing the 3 empty fields might not be a good option as another page on this site might be slightly different.

How can I get only one element per key-value?

Upvotes: 0

Views: 75

Answers (1)

Umair Ayub
Umair Ayub

Reputation: 21201

You want to extract those specifications?

This is 100% working code that extracts key-value pairs from specs table

specs = {}
for td in response.css(".specifications .col.s12.l5")[0].css("tr"):
    specs[td.css("td")[0].css("::text").extract_first()] = td.css("td")[1].css("::text").extract_first()

{u'Uhr f\xfcr': u'Damen', u'Glas': u'Saphirglas', u'Artikel\xadnummer': u'C79467', u'Gr\xf6\xdfe (Geh\xe4use)': u'29\xa0mm', u'Material (Geh\xe4use)': u'Edelstahl', u'Werk': u'Quarz', u'Armband': u'Kautschuk', u'Referenz': u'W7234812.A785.249S.A12D.4'}

Upvotes: 1

Related Questions