Reputation: 556
I'm a complete beginner in javascript and web scraping using puppeteer
and I am trying to get the scores of a simple euroleague round in
https://www.euroleague.net/main/results?gamenumber=28&phasetypecode=RS&seasoncode=E2019
By inspecting
the score list above I find out that the score list is a div
element containing other divs
inside with the stats displayed .
HTML
for a single match between 2 teams (there are more divs for matches below this example )
//score list
<div class="wp-module wp-module-asidegames wp-module-5lfarqnjesnirthi">
//the data-code increases to "euro_245" ...
<div class="">
<div class="game played" data-code="euro_244" data-date="1583427600000" data-played="1">
<a href="/main/results/showgame?gamecode=244&seasoncode=E2019" class="game-link">
<div class="club">
<span class="name">Zenit St Petersburg</span>
<span class="score homepts winner">76</span>
</div>
<div class="club">
<span class="name">Zalgiris Kaunas</span>
<span class="score awaypts ">75</span>
</div>
<div class="info">
<span class="date">March 5 18:00 CET</span>
<span class="live">
LIVE <span class="minute"></span>
</span>
<span class="final">
FINAL
</span>
</div>
</a>
</div>
//more teams
</div>
</div>
What I want is to iterate through the outer div element and get the teams playing and the score of each match and store them in a json file . However since I am a complete beginner I do not understand how to iterate through the html above . This is my web scraping code to get the element :
const puppeteer = require('puppeteer');
const sleep = (delay) => new Promise((resolve) => setTimeout(resolve,delay));
async function getTeams(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
await sleep(3000);
const games = await page.$x('//*[@id="main-one"]/div/div/div/div[1]/div[1]/div[3]');
//this is where I will execute the iteration part to get the matches with their scores
await sleep(2000);
await browser.close();
}
getTeams('https://www.euroleague.net/main/results?gamenumber=28&phasetypecode=RS&seasoncode=E2019');
I would appreciate your help with guiding me through the iteration part . Thank you in advance
Upvotes: 1
Views: 726
Reputation: 8841
The most accurate selector for a game box is div.game.played
(a div which both has the .game
and the .played
CSS classes), you will need to count the elements that match this criteria. It is possible with page.$$eval
(page .$$eval (selector, pageFunction[, ...args])
) which runs Array.from(document.querySelectorAll(selector))
within the page and passes it as the first argument to pageFunction
.
As we are using the element indexes for the specific data fields we run a regular for loop with the length of the elements.
If you need a specific range of "euro_xyz" you can get the data-code
attribute values in a page.evaluate
method with Element.getAttribute
and check their number against the desired "xyz" number.
To collect each game's data we can define a collector array (gameObj
) which can be extended with each iteration. In each iteration we fill an actualGame
object with the actual data.
It is important to determine which child elements contain the corresponding data values, e.g.: the home club's name is 'div.game.played > a > div:nth-child(1) > span:nth-child(1)'
the div child number selects the club while the span child number decides between the club name and the points. The loop's [i]
index is responsible for grabbing the right game box's values (that's why it was counted in the beginning).
For example:
const allGames = await page.$$('div.game.played')
const allGameLength = await page.$$eval('div.game.played', el => el.length)
const gameObj = []
for (let i = 0; i < allGameLength; i++) {
try {
let dataCode = await page.evaluate(el => el.getAttribute('data-code'), allGames[i])
dataCode = parseInt(dataCode.replace('euro_', ''))
if (dataCode > 243) {
const actualGame = {
homeClub: await page.evaluate(el => el.textContent, (await page.$$('div.game.played > a > div:nth-child(1) > span:nth-child(1)'))[i]),
awayClub: await page.evaluate(el => el.textContent, (await page.$$('div.game.played > a > div:nth-child(2) > span:nth-child(1)'))[i]),
homePoints: await page.evaluate(el => el.textContent, (await page.$$('div.game.played > a > div:nth-child(1) > span:nth-child(2)'))[i]),
awayPoints: await page.evaluate(el => el.textContent, (await page.$$('div.game.played > a > div:nth-child(2) > span:nth-child(2)'))[i]),
gameDate: await page.evaluate(el => el.textContent, (await page.$$('div.game.played > a > div:nth-child(3) > span:nth-child(1)'))[i])
}
gameObj.push(actualGame)
}
} catch (e) {
console.error(e)
}
}
console.log(JSON.stringify(gameObj))
There is a page.waitFor
method in puppeteer for the same purpose as your sleep
function, but you can also wait for selectors to be appeared (page.waitForSelector
).
Upvotes: 1