Amen Aziz
Amen Aziz

Reputation: 779

Scrape specific text from table

from scrapy import Spider
from scrapy.http import Request


class AuthorSpider(Spider):
    name = 'book'
    start_urls = ['https://www.amazon.sg/s?k=Measuring+Tools+%26+Scales&i=home&crid=1011S67HHJSEW&sprefix=measuring+tools+%26+scales%2Chome%2C408&ref=nb_sb_noss']

    def parse(self, response):
        books = response.xpath("//h2/a/@href").extract()
        for book in books:
            url = response.urljoin(book)
            yield Request(url, callback=self.parse_book)

    def parse_book(self, response):
        rows = response.xpath('//table[@id="productDetails_techSpec_section_1"]//tr')
        table={}
        for row in rows:
            brand = row.xpath("//th[@class='a-color-secondary a-size-base prodDetSectionEntry' and contains(text(), 'Brand')]/following-sibling::td/text()").get()
            asin = row.xpath("//th[@class='a-color-secondary a-size-base prodDetSectionEntry' and contains(text(), 'ASIN')]/following-sibling::td/text()").get().replace('\u200e',"")
            table.update({'Brand':brand,'Asin':asin})
        yield table

I want to scrape only brand and ASIN from the table I scape the text from the product information these is the link https://www.amazon.sg/Etekcity-Accurate-Measuring-Packages-Stainless/dp/B08BPB9T1N/ref=sr_1_1?crid=1011S67HHJSEW&keywords=Measuring%2BTools%2B%26%2BScales&qid=1643125635&s=home&sprefix=measuring%2Btools%2B%26%2Bscales%2Chome%2C408&sr=1-1&th=1 enter image description here

Upvotes: 0

Views: 99

Answers (1)

mr_mooo_cow
mr_mooo_cow

Reputation: 1128

If you just need brand and ASIN you don't need to iterate through the whole table. You can use xpath to directly select those attributes. One way to do it is using following.

brand = response.xpath("//th[@class='a-color-secondary a-size-base prodDetSectionEntry' and contains(text(), 'Brand')]/following-sibling::td/text()").get()

asin = response.xpath("//th[@class='a-color-secondary a-size-base prodDetSectionEntry' and contains(text(), 'ASIN')]/following-sibling::td/text()").get()

You might need to clean up the resulting text a bit using str().strip(). All this xpath is saying is "find the th tag with the right class and with a text of 'Brand' or 'ASIN' then look ahead to the next TD tag and grab that text."

Upvotes: 1

Related Questions