GamerDD
GamerDD

Reputation: 1

Scrapy Scrape Table Data

I'm a beginner to coding now I started with Python and Scrapy and this is my first code.

Running against the following issue that the table I am scraping is not formatted in columns with header/index but in a string as each page has a variable amount of columns and rows its get difficult splitting everything up afterwards in a .CSV or JSON as the attributes will get mixed.

Examples: https://www.kavalier.cz/en/lab-burners-sp292.html

Columns:

Code
Type
Pressure (Pa)
Consumption (Nm3/h)
Output (W)
Weight (g)

https://www.kavalier.cz/en/desiccator-with-glass-knob-sp94.html

Columns:

Code
Number
Type
d1 (mm)
d2 (mm)
h (mm)
Packing (pc)

    #Open product page
def parse(self, response):
    urls = response.css('a.btn.btn-default::attr(href)').extract()
    for url in urls:
        url = response.urljoin(url)
        yield scrapy.Request(url=url, callback=self.parse_details)


    #Pagination
    next_page_url = response.css('a.page-link.next::attr(href)').extract_first()
    if next_page_url:
        next_page_url = response.urljoin(next_page_url)
        yield scrapy.Request(url=next_page_url, callback=self.parse)

#Product Details
def parse_details(self, response):
    yield {
    'Product_Name': response.css('.content > h2::text').extract_first(),
    'Category': response.css('.breadcrumb > li:nth-child(4) > a ::text').extract_first(),
    'Image_Url': response.css('.main-img > a::attr(href)').extract_first(),
    'Table': response.xpath('//tr/td/text()').extract(),

    }`

How can I adjust my code that all variable table headers will be counted and put in columns + their data.

Upvotes: 0

Views: 1513

Answers (1)

Jai
Jai

Reputation: 849

I am Assuming that you are trying to scrape table data from a website, in that case use can use the below code. It will do the work for you easily.

import requests
import pandas as pd
url = 'https://www.kavalier.cz/en/desiccator-with-glass-knob-sp94.html'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)

Upvotes: 1

Related Questions