Reputation: 21

Extracting string from html web scrape

I'm looking for some guidance on a web scraping script i'm working on.

All is going well but I'm stuck on stripping out the image file data.

I'm currently doing a WebRequest, getting elements by class, selecting outerHTML, but need to strip out just the contents of attribute data-imagezoom as per this example.

Sample data:

<a class="aaImg" href="https://imagehost.ssl.server123.com/Product-800x800/image.jpg">
    <img class="aaTmb" alt="Matrix 900 x 900 test" src="https://imagehost.ssl.server123.com/Product-190x190/image.jpg" item="image"
         data-imagezoom="https://imagehost.ssl.server123.com/Product-1600x1600/image.jpg" data-thumbnail="https://imagehost.ssl.server123.com/Product-190x190/image.jpg">
    </img>
</a>

Current code to get that data:

$ProductInfo = Invoke-WebRequest -Uri $ProductURL
$ProductImageRaw = $ProductInfo.ParsedHTML.body.getElementsByClassName("aaImg") |
                   Select outerHTML

I can obviously get the first image by selecting the href attribute easily.

I was 'dirty coding' by replacing 800x800 with 1600x1600 as the filenames are the same, just a different path, but that came unstuck pretty quick when there were inconsistencies in path names.

Upvotes: 2

Answers (2)

mklement0

Reputation: 440132

You need to access the outer <a> element's <img> child element and call its .getAttribute() method to get the attribute value of interest:

$ProductInfo.ParsedHTML.body.getElementsByClassName("aaImg").
  childnodes[0].getAttribute('data-imagezoom')

.childnodes[0] returns the first child node (element)
.getAttributes('data-imagezoom') returns the value of the data-imagezoom attribute.^[1]

This should return string https://imagehost.ssl.server123.com/Product-1600x1600/image.jpg.

As for your own answer:

Using regexes (or substring search) to parse structured data such as HTML and XML is brittle and best avoided.
For instance, if the source HTML changes to use '...' instead of "..." around attribute values, your solution breaks (this particular case is not hard to account for in a regex, but there are many more ways in which such parsing can go wrong).

Cross-platform perspective:

Regrettably, the .ParsedHTML property with its HTML DOM is only available in Windows PowerShell (and its COM implementation is cumbersome and slow to work with in PowerShell).

PowerShell Core, even on Windows, doesn't support it, and there's no in-box HTML parser available (as of PowerShell Core 6.2.0).

The HtmlAgilityPack NuGet package is a popular open-source HTML parser, but it is aimed at C# and therefore nontrivial to install and use in PowerShell.

That said, this answer by TheIncorrigible1 has a working example that downloads the required assembly on demand.

^{[1] Note that .getAttribute() is necessary to access custom attributes, whereas standard attributes such as id and, in the case of <a> elements, href, are represented directly as object properties (e.g., .id; note that .getAttribute() works with standard attributes too.)}

Upvotes: 1

blackcat_au

Reputation: 21

So, after a quick crash course in some Regex, this is what I've come up with.

(?<=data-imagezoom=").*?(?="\s)

A positive lookbehind, select all until the closing quotes and whitespace.

Thanks all.

Upvotes: 0

Extracting string from html web scrape

Answers (2)

Related Questions