Reputation: 1
I'm a beginner to coding now I started with Python and Scrapy and this is my first code.
Running against the following issue that the table I am scraping is not formatted in columns with header/index but in a string as each page has a variable amount of columns and rows its get difficult splitting everything up afterwards in a .CSV or JSON as the attributes will get mixed.
Examples: https://www.kavalier.cz/en/lab-burners-sp292.html
Columns:
Code
Type
Pressure (Pa)
Consumption (Nm3/h)
Output (W)
Weight (g)
https://www.kavalier.cz/en/desiccator-with-glass-knob-sp94.html
Columns:
Code
Number
Type
d1 (mm)
d2 (mm)
h (mm)
Packing (pc)
#Open product page
def parse(self, response):
urls = response.css('a.btn.btn-default::attr(href)').extract()
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url=url, callback=self.parse_details)
#Pagination
next_page_url = response.css('a.page-link.next::attr(href)').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)
#Product Details
def parse_details(self, response):
yield {
'Product_Name': response.css('.content > h2::text').extract_first(),
'Category': response.css('.breadcrumb > li:nth-child(4) > a ::text').extract_first(),
'Image_Url': response.css('.main-img > a::attr(href)').extract_first(),
'Table': response.xpath('//tr/td/text()').extract(),
}`
How can I adjust my code that all variable table headers will be counted and put in columns + their data.
Upvotes: 0
Views: 1513
Reputation: 849
I am Assuming that you are trying to scrape table data from a website, in that case use can use the below code. It will do the work for you easily.
import requests
import pandas as pd
url = 'https://www.kavalier.cz/en/desiccator-with-glass-knob-sp94.html'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
Upvotes: 1