zgall1
zgall1

Reputation: 3025

IMPORTXML function in Google Sheets

Using the IMPORTXML function, is it possible to construct an XPATH query that pulls the Industry value for a given Wikipedia page?

For example, the value I want to pull from this page - https://en.wikipedia.org/wiki/Target_Corporation - is "Retail" whereas on this page - https://en.wikipedia.org/wiki/Boohoo.com - it would be "Fashion".

Upvotes: 2

Views: 338

Answers (2)

Tanaike
Tanaike

Reputation: 201348

  • You want to create the xpath for retrieving the Industry value for a given Wikipedia page.

If my understanding is correct, as other pattern, how about the formula with this xpath? Please think of this as just one of several answers.

Sample formula:

=IMPORTXML(A1,"//th[text()='Industry']/following-sibling::td")
  • The xpath is //th[text()='Industry']/following-sibling::td.
  • In this case, the URL of https://en.wikipedia.org/wiki/Target_Corporation or https://en.wikipedia.org/wiki/Boohoo.com is put in the cell "A1".

Result:

enter image description here

Reference:

Added:

From your replying, I knew that you want to add 2 more URLs. So all URLs are as follows.

Issue and workaround:

For above URLs, when the formula of =IMPORTXML(A1,"//th[text()='Industry']/following-sibling::td") is used, Retail, Fashion, Retail and Travel, services are returned.

When the xpath is modified to //th[text()='Industry']/following-sibling::td/a, Retail, #N/A, #N/A and Travel are returned.

The reason of this is due to the following difference.

<tr>
  <th scope="row">Industry</th>
  <td class="category"><a href="/wiki/Travel" title="Travel">Travel</a> services</td>
</tr>

and

<tr>
  <th scope="row" style="padding-right:0.5em;">Industry</th>
  <td class="category" style="line-height:1.35em;"><a href="/wiki/Retail" title="Retail">Retail</a></td>
</tr>

and

<tr>
  <th scope="row" style="padding-right:0.5em;">Industry</th>
  <td class="category" style="line-height:1.35em;">Fashion</td>
</tr>

By this, I think that unfortunately, in order to retrieve Travel, Retail and Fashion from above, those cannot be directly retrieved with only one xpath. So I used a built-in function for this situation.

Workaround:

In this workaround, I used INDEX. Please think of this as just one of several answers.

=INDEX(IMPORTXML(A1,"//th[text()='Industry']/following-sibling::td"),1,1)
  • The xpath is //th[text()='Industry']/following-sibling::td. This is not modified.
  • In this case, the URL is put in the cell "A1".
  • When 2 values are retrieved, the 1st one is retrieved. By this, I used INDEX.
Result:

enter image description here

Upvotes: 2

player0
player0

Reputation: 1

try:

=INDEX(IMPORTXML("https://en.wikipedia.org/wiki/Boohoo.com", 
 "//td[@class='category']"), 2, 1)

=INDEX(IMPORTXML("https://en.wikipedia.org/wiki/Target_Corporation", 
 "//td[@class='category']"),2,1)

0

Upvotes: 0

Related Questions