Data_is_Power
Data_is_Power

Reputation: 785

How to parse Table from Wikipedia using htmltab package?

All, I am trying to parse 1 table located here https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population#Sovereign_states_and_dependencies_by_population. And I would like to use htmltab package to achieve this task. Currently my code looks like following. However I am getting below Error. I tried passing "Rank", "% of world population " in which function, but still received an error. I am not sure, what could be wrong ?

Please Note: I am new to R and Webscraping, if you could provide explanation of the code, that will be great help.

url3 <- "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population#Sovereign_states_and_dependencies_by_population"
list_of_countries<- htmltab(doc = url3, which = "//th[text() = 'Country(or dependent territory)']/ancestor::table")

Error: Couldn't find the table. Try passing (a different) information to the which argument.

Upvotes: 1

Views: 265

Answers (1)

mathematical.coffee
mathematical.coffee

Reputation: 56915

This is an XPath problem not an R problem. If you inspect the HTML of that table the relevant header is

<th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">
  Country<br><small>(or dependent territory)</small>
</th>

So text() on this is just "Country".

For example this could work (this is not the only option, you will just have to try out various xpath selectors to see).

htmltab(doc = url3, which = "//th[text() = 'Country']/ancestor::table")

Alternatively it's the first table on the page, so you could try which=1 instead.

(NB in Chrome you can do $x("//th[text() = 'Country']") and so on in the developer console to try these things out, and no doubt in other browsers also)

Upvotes: 1

Related Questions