IMPORTXML function in Google Sheets

Question

Using the IMPORTXML function, is it possible to construct an XPATH query that pulls the Industry value for a given Wikipedia page?

For example, the value I want to pull from this page - https://en.wikipedia.org/wiki/Target_Corporation - is "Retail" whereas on this page - https://en.wikipedia.org/wiki/Boohoo.com - it would be "Fashion".

Tanaike · Accepted Answer

You want to create the xpath for retrieving the Industry value for a given Wikipedia page.

If my understanding is correct, as other pattern, how about the formula with this xpath? Please think of this as just one of several answers.

Sample formula:

=IMPORTXML(A1,"//th[text()='Industry']/following-sibling::td")

The xpath is //th[text()='Industry']/following-sibling::td.
In this case, the URL of https://en.wikipedia.org/wiki/Target_Corporation or https://en.wikipedia.org/wiki/Boohoo.com is put in the cell "A1".

Result:

Reference:

XPath Axes

Added:

From your replying, I knew that you want to add 2 more URLs. So all URLs are as follows.

https://en.wikipedia.org/wiki/Target_Corporation
`https://en.wikipedia.org/wiki/Boohoo.com
`https://en.wikipedia.org/wiki/Woot
`https://en.wikipedia.org/wiki/TripAdvisor

Issue and workaround:

For above URLs, when the formula of =IMPORTXML(A1,"//th[text()='Industry']/following-sibling::td") is used, Retail, Fashion, Retail and Travel, services are returned.

When the xpath is modified to //th[text()='Industry']/following-sibling::td/a, Retail, #N/A, #N/A and Travel are returned.

The reason of this is due to the following difference.


  Industry
  Travel services

and


  Industry
  Retail

and


  Industry
  Fashion

By this, I think that unfortunately, in order to retrieve Travel, Retail and Fashion from above, those cannot be directly retrieved with only one xpath. So I used a built-in function for this situation.

Workaround:

In this workaround, I used INDEX. Please think of this as just one of several answers.

=INDEX(IMPORTXML(A1,"//th[text()='Industry']/following-sibling::td"),1,1)

The xpath is //th[text()='Industry']/following-sibling::td. This is not modified.
In this case, the URL is put in the cell "A1".
When 2 values are retrieved, the 1st one is retrieved. By this, I used INDEX.

Result:

IMPORTXML function in Google Sheets

Answers (2)

Sample formula:

Result:

Reference:

Added:

Issue and workaround:

Workaround:

Related Questions