Shannon Carter
Shannon Carter

Reputation: 11

Using Google Sheet Formula IMPORTXML to extract hyperlinks from table on web page, and flag when an image is in seperate column

I'm doing some analysis which requires me to save table data and (hyperlinked) links to lots of PDF's from a webpage (https://www.asx.com.au/asx/v2/statistics/prevBusDayAnns.do).

I've been playing around with the =IMPORTHTML and =IMPORTXML formulas in Google Sheets and have managed to extract the table data using =IMPORTHTML(A1,"table",1), but I'm struggling to extract the "Price sens." column which contains images or the hyperlinks attached to the "Headline" items. I'm having no luck with IMPORTXML so far, and can't seem to find any solutions online.

enter image description here

enter image description here

Upvotes: 1

Views: 506

Answers (1)

Cornelius Roemer
Cornelius Roemer

Reputation: 8346

The formula for IMPORTXML you're looking for is:

=IMPORTXML("https://www.asx.com.au/asx/v2/statistics/prevBusDayAnns.do","//*[@id='content']/div/announcement_data/table/tbody/tr")

You need to provide an XPATH, which you can get by clicking on an element in the browser dev tools and selecting copy > XPATH.

Unfortunately, while this does produce output, it's just the same as for IMPORTHTML. The price sensitivity column is always empty, too.

The reason for this is, that the content of the price sensitivity columns is not text, but an image, as you can see in your screenshots.

So it looks like you need some more powerful HTML parsing tools here than Google Sheets provides. It would be easy to look for img tags if you parsed the website using Python and Beautifulsoup, for instance. So you may want to go down this route.

Here's what I got using IMPORTXML, same as you:

enter image description here

The problem is that price sensitivity is img not text:

enter image description here

Upvotes: 1

Related Questions