Reputation: 345
I'm trying to scrape this website https://www.tahko.com/fi/tapahtumat/. I've been able to scrape the events on the main table, but I now need to scrape the months corresponding to each table.
The months (e.g Lokakuu 2020 or Marraskuu 2020) are inside h2 tags, have the style "font-size:32px;" and are inside the class(which is the entire td area) "col-lg-8 col-md-8 col-sm-12 col-xs-12".
Here's the HTML code. This is placed inside a div, with the above-mentioned class.
<h2 style="font-size:32px;">LOKAKUU 2020</h2>
How can I scrape these months?
What I've tried so far is:
fetch("https://www.tahko.com/fi/tapahtumat/")
full = response.xpath('//*[@class="col-lg-8 col-md-8 col-sm-12 col-xs-12"]')
months = full.xpath('/*[@style="font-size:32px;"]')
Bonus question: What would be the easiest way to match these months up to the event tables below them?
Upvotes: 1
Views: 507
Reputation: 20052
I didn't want to set up an entire scrapy projct but this should get you started, I hope.
import requests
from lxml import html
header_month_xpath = '//*[@style="font-size:32px;"]/text()'
month_widget_xpath = '//*[@class="widget"]/a/text()'
page = requests.get("https://www.tahko.com/fi/tapahtumat/").text
print(html.fromstring(page).xpath(header_month_xpath))
print(html.fromstring(page).xpath(month_widget_xpath))
Output:
['LOKAKUU 2020', 'MARRASKUU 2020', 'JOULUKUU 2020']
['Kaikki menovinkit', 'Tammikuu 2021', 'Helmikuu 2021', 'Maaliskuu 2021', 'Huhtikuu 2021', 'Toukokuu 2021', 'Kesäkuu 2021', 'Heinäkuu 2021', 'Elokuu 2021', 'Syyskuu 2021', 'Lokakuu 2020', 'Marraskuu 2020', 'Joulukuu 2020']
Upvotes: 1