Ririx
Ririx

Reputation: 25

Web scraping without id VBA

I'm trying to scrape a web , some elements were easy to get . But I have a problem with those who have no id like this .

<TABLE class=DisplayMain1 cellSpacing=1 cellPadding=0><TBODY> <TR class=TitleLabelBig1> <TD class=Title1 colSpan=100><SPAN style="FONT-FAMILY: arial narrow; FONT-WEIGHT: normal">Tool &amp; </SPAN><BR>PE311934-1-1 </TD></TR></TBODY></TABLE>

i want this ---►PE311934-1-1

i Try with "document.getElementsByClassName" but the vba gave me a error :/..

some tip?

Upvotes: 1

Views: 1644

Answers (2)

QHarr
QHarr

Reputation: 84465

You don't specify the error and there is not enough HTML to know how many elements there are on the page.

You may have forgotten to use an index with document.getElementsByClassName("Title1"), as it returns a collection

For example, the first item would be: document.getElementsByClassName("Title1")(0)


In the same way, you could use a CSS querySelector such as .Title1

Which says the same thing i.e. select the elements with ClassName "Title1".

For the first instance simply use:

document.querySelector(".Title1")

For a nodeList of all matching

 document.querySelectorAll(".Title1") 

and then iterate over its length.


You would access the .innerText property of the element, generally, to retrieve the required string.


For the snippet shown, assuming the item is the first .Title1 on the page the CSS selector retrieves the following from your HTML

CSS query

The resultant string can then be processed for what you want. This method, and regex, are fragile at best considering how easily an updated source page can break these methods.

In your above example, you can use the class name, .Title1, and then use Replace() to remove the Tool & .

Upvotes: 1

AnalystCave.com
AnalystCave.com

Reputation: 4974

Use Regular Expressions and the XMLHttpRequest object in VBA

I made a AddIn some time ago that does just that:

http://www.analystcave.com/excel-tools/excel-scrape-html-add/

If you just want the source code then here (GetElementByRegex function):

http://www.analystcave.com/excel-scrape-html-element-id/

Now the actual regex will be quite simple:

</SPAN><BR>(.*?)</TD></TR></TBODY></TABLE>

If it captures too much items simply expand the regex.

Upvotes: 2

Related Questions