Reputation: 2433
I am trying to write a Playwright JS test to scrape some values from a website.
Here is the HTML of the page I am trying to scrape:
<div class="pl-it">
<div class="i_d">
<dl>
<dt class>Series:
<dd>
<a href=".....">Province A</a>
</dd>
</dt>
<dt class>Catalog Codes:
<dd>
<strong>Mi</strong>
<strong>CA 1x,</strong>
"ca 17"
</dd>
</dt>
<dt class>Variants:
<dd><strong><a>Click to see variants</a></strong></dd>
</dt>
</dl>
</div>
<div class="i_d">
<dl>
<dt class>Series:
<dd>
<a href=".....">Province B</a>
</dd>
</dt>
<dt class>Catalog Codes:
<dd>
<strong>Fu</strong>
<strong>DE 2x,</strong>
"pa 21"
</dd>
</dt>
<dt class>Variants:
<dd><strong><a>Click to see variants</a></strong></dd>
</dt>
</dl>
</div>
</div>
As you can see, there are multiple divs that have class i_d
, and inside those there are multiple dl
tags.
Inside each dl
tag, there is a pair of dt
& dd
tags.
Basically, what I am trying to do is log each dt
value & each corresponding dd
value to the console.
The final outcome should look something like this in the logs:
Series: Province A
Catalog Codes: Mi CA 1x, ca17
Variants: CLick to see variants
Series: Province B
Catalog Codes: Fu DE 2x, pa21
Variants: CLick to see variants
Below is my current output:
[
{
label: 'Series:',
name: 'Province A'
},
{
label: 'Series:',
name: 'Province B'
},
]
As you can see, it is only printing out the first dt
& dd
values, not the remaining ones (i.e. Catalog Codes
, etc.)
Here is my current Playwright JS code:
const { test, expect } = require('@playwright/test');
test('homepage has Playwright in title and get started link linking to the intro page', async ({ page }) => {
await page.goto('https://colnect.com/en/stamps/list/country/38-Canada');
await expect(page.locator('div#pageContent h1')).toContainText('Stamp catalog › Canada › Stamps')
const books = await page.$$eval('div.i_d', all_items => {
const data =[];
all_items.forEach(book => {
const label = book.querySelector('dt')?.innerText;
const name = book.querySelector('dd')?.innerText;
data.push({ label, name});
})
return data;
});
console.log(books);
});
Can someone please tell me how I can access each dt
& dd
rather than just the first one in each group?
Upvotes: 1
Views: 3770
Reputation: 2509
The dom is a little complicated and had to use nested loops to get the format you are looking for.
test.describe('Scrap', async () => {
test('Stamps', async ({ page }) => {
await page.goto('https://colnect.com/en/stamps/list/country/38-Canada');
await page.waitForLoadState('networkidle');
await expect(page.locator('div#pageContent h1')).toContainText('Stamp catalog › Canada › Stamps');
const scrappedStampData = await page.$$eval('div.i_d', (stamps) => {
let stampsArray = [];
let stampObject = {};
stamps.forEach(async (stamp) => {
stamp.querySelectorAll('dt').forEach((row) => {
const rowLabel = row.innerText;
const rowValue = row.nextElementSibling.innerText;
stampObject[rowLabel] = rowValue;
});
stampsArray.push(stampObject);
stampObject = {};
});
return stampsArray;
});
scrappedStampData.forEach((stampData, ind) => {
console.log(`\n**************Stamp: ${ind + 1}*****************\n`);
for (var key in stampData) {
console.log(key + ' ' + stampData[key]);
}
});
});
});
Output:
**************Stamp: 1*****************
Series: Province of Canada Pence Issue (imperforate)
Catalog codes: Mi:CA 1x, Sn:CA 8, Yt:CA 4, Sg:CA 17
Variants: Click to see variants
Themes: Crowns and Coronets | Famous People | Heads of State | Queens | Royalty | Women
Issued on: 1857-08-01
Colors: Rose
Printers: Rawdon, Wright, Hatch & Edson
Format: Stamp
Emission: Definitive
Perforation: Imperforate
Printing: Recess
Paper: machine-made medium to thick wove
Face value: ½ d - Canadian penny
Print run: 2,600,000
Score: 95% Accuracy: High
Buy Now: Find similar items on eBay
Upvotes: 1