SirAchesis
SirAchesis

Reputation: 345

Using Scrapy to scrape h2 tags inside a specific class or style

I'm trying to scrape this website https://www.tahko.com/fi/tapahtumat/. I've been able to scrape the events on the main table, but I now need to scrape the months corresponding to each table.

The months (e.g Lokakuu 2020 or Marraskuu 2020) are inside h2 tags, have the style "font-size:32px;" and are inside the class(which is the entire td area) "col-lg-8 col-md-8 col-sm-12 col-xs-12".

Here's the HTML code. This is placed inside a div, with the above-mentioned class.

<h2 style="font-size:32px;">LOKAKUU 2020</h2>

How can I scrape these months?

What I've tried so far is:

fetch("https://www.tahko.com/fi/tapahtumat/")

full = response.xpath('//*[@class="col-lg-8 col-md-8 col-sm-12 col-xs-12"]')

months = full.xpath('/*[@style="font-size:32px;"]')

Bonus question: What would be the easiest way to match these months up to the event tables below them?

Upvotes: 1

Views: 507

Answers (1)

baduker
baduker

Reputation: 20052

I didn't want to set up an entire scrapy projct but this should get you started, I hope.

import requests
from lxml import html

header_month_xpath = '//*[@style="font-size:32px;"]/text()'
month_widget_xpath = '//*[@class="widget"]/a/text()'

page = requests.get("https://www.tahko.com/fi/tapahtumat/").text

print(html.fromstring(page).xpath(header_month_xpath))
print(html.fromstring(page).xpath(month_widget_xpath))

Output:

['LOKAKUU 2020', 'MARRASKUU 2020', 'JOULUKUU 2020']
['Kaikki menovinkit', 'Tammikuu 2021', 'Helmikuu 2021', 'Maaliskuu 2021', 'Huhtikuu 2021', 'Toukokuu 2021', 'Kesäkuu 2021', 'Heinäkuu 2021', 'Elokuu 2021', 'Syyskuu 2021', 'Lokakuu 2020', 'Marraskuu 2020', 'Joulukuu 2020']

Upvotes: 1

Related Questions