Reputation: 627
I am trying to scrape a music website with Puppetter. I want the audio "src" scraped from the website, but the website assigns the src dynamically when the users play a track. So, I have a script that plays each track and then, I want to grab the "src" from the audio tag. But, I have this error "page is not defined". I think the "Puppetter.page" is not defined inside callback fns. So, I need your help with it.
import puppeteer from 'puppeteer-core';
import appendJSONdata from './utils/appendJSONdata.js';
export function scrape() {
try {
(async () => {
// set some options (set headless to false so we can see this automated browsing experience)
let launchOptions = {
headless: true,
executablePath:
'C:/Program Files (x86)/Google/Chrome/Application/chrome.exe', // because we are using puppeteer-core so we must define this option
args: ['--start-maximized'],
};
const browser = await puppeteer.launch(launchOptions);
const page = await browser.newPage();
// set viewport and user agent (just in case for nice viewing)
await page.setViewport({ width: 1366, height: 768 });
await page.setUserAgent(
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
);
// Go to the chillHop Albums Page
await page.goto('https://chillhop.com/releases/');
const albumLinks = await page.$$eval('.release > a', (list) =>
list.map((elm) => elm.href)
); // 12 Albums Load Initaially
// console.log(albumLinks);
let audioRef = await page.$('audio')
// console.log();
//getAudioSrc(111)
for (const albumURL of albumLinks) {
// console.log(albumURL);
await page.goto(albumURL);
// async function getAudioSrc() {
// return await page.$('audio').getAttribute('src')
// }
let numOfTracks = await page.$$eval('.track-single', (tracks) => {
// console.log(page);
// if (tracks.length >= 5) {
return tracks.map(track => {
track.querySelector(`a.track-${track.children[0].getAttribute("data-track")}`).click() // Plays the track
return {
"data-track": track.children[0].getAttribute("data-track"),
"title": track.querySelector("div.trackTitle").textContent,
"artists": track.querySelectorAll("div.trackArtists")[0].textContent,
"duration": track.querySelector("div.track-length").textContent,
"audio-src": page.querySelector('audio').getAttribute('src') // ! page is not defined
}
// let dataTrack = track.children[0].getAttribute("data-track")
})
// } else {
// return "Less than 5 tracks"
// }
});
console.log(numOfTracks);
// (numOfTracks > 5) ? (scrape the site) : (do not scrape)
}
// appendJSONdata("This is randome data")
// close the browser
await browser.close();
})();
} catch (error) {
console.log(error);
}
}
Upvotes: 0
Views: 715
Reputation: 1921
Effectively, you are seeing that error page is not defined
because you are trying to reference an outside variable page
, from inside the puppeteer evaluate (in this case, $$eval()
) script.
Note that this evaluate script runs inside the browser context, not in your Node app, so it has no knowledge/access to any variable that might be defined outside of it, unless you explicitly pass it as a reference.
let numOfTracks = await page.$$eval('.track-single', (tracks) => {
// ...
// the line below throws Error because 'page' is a variable that is not present in the browser context, and hasn't been passed as a reference neither.
"audio-src": page.querySelector('audio').getAttribute('src') // ! page is not defined
// ...
In your specific case, it looks like you just want to access the document
object, which is available in the browser context, not the pupeeteer Page page
object.
So a possible solution could be the following.
let numOfTracks = await page.$$eval('.track-single', (tracks) => {
// console.log(page);
// if (tracks.length >= 5) {
return tracks.map(track => {
track.querySelector(`a.track-${track.children[0].getAttribute("data-track")}`).click() // Plays the track
return {
"data-track": track.children[0].getAttribute("data-track"),
"title": track.querySelector("div.trackTitle").textContent,
"artists": track.querySelectorAll("div.trackArtists")[0].textContent,
"duration": track.querySelector("div.track-length").textContent,
"audio-src": document.querySelector('audio').getAttribute('src')
}
// let dataTrack = track.children[0].getAttribute("data-track")
})
// } else {
// return "Less than 5 tracks"
// }
});
Another option could be extracting the value audioSrc
first, assuming it's some global reference you are trying to get, and then passing it as a reference to $$eval()
.
Please reference the puppeteer docs on page.$$eval() for more details.
const audioSrc = await page.evaluate(() => document.querySelector('audio').getAttribute('src'))
let numOfTracks = await page.$$eval('.track-single', (tracks, audioSrc) => {
return tracks.map(track => {
track.querySelector(`a.track-${track.children[0].getAttribute("data-track")}`).click() // Plays the track
return {
// ...
"audio-src": audioSrc // <-- passed by reference above
}
})
}, audioSrc); // <-- notice the extra arg passed to the $$eval() script
Honestly, now when the original issue is understood, you'll be the only one that can find the best option based on the knowledge you have of the site itself, and your current goals and requirements.
Upvotes: 1