Reputation: 1699
There is a table of taxes by country at the link below that I would like to scrape into a dataframe with Country and Tax columns.
I've tried using the rvest package as follows to get my Country column but the list I generate is empty and I don't understand why.
I would appreciate any pointers on resolving this problem.
library(rvest)
d1 <- read_html(
"http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates"
)
TaxCountry <- d1 %>%
html_nodes('.countryNameQC') %>%
html_text()
Upvotes: 0
Views: 222
Reputation: 84465
The data is dynamically loaded and the DOM altered when javascript runs in the browser. This doesn't happen with rvest
.
The following selectors, in the browser, would have isolated your nodes:
.twoCountryWrapper .countryNameAndYearQC:nth-child(1) .countryNameQC
.twoCountryWrapper .countryNameAndYearQC:nth-child(1) .countryYear
.twoCountryWrapper .countryNameAndYearQC:nth-child(2) .countryNameQC
.twoCountryWrapper .countryNameAndYearQC:nth-child(2) .countryYear
But, those classes are not even present in rvest
return.
The data of interest is actually stored in several nodes; all of which have ids within a common prefix of dspQCLinks
. The data inside looks like as follows:
So, you can gather all those nodes using css attribute = value with starts with operator (^) syntax:
html_nodes(page, "[id^=dspQCLinks]")
Then extract the text and combine into one string
paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = '')
Now each row in your table is delimited by !,
, so we can split on that to generate the rows:
info = strsplit(paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = ''),"!,")[[1]]
An example row would then look like:
"Albania@/uk/taxsummaries/wwts.nsf/ID/Albania-Corporate-Taxes-on-corporate-income@15"
If we split each row on the @
, the data we want is at indices 1 and 3:
arr = strsplit(i, '@')[[1]]
country <- arr[1]
tax <- arr[3]
Thanks to @Brian's feedback I have removed the loop I had to build the dataframe and replaced with, to quote @Brian,
str_split_fixed(info, "@", 3)
[which] gives you a character matrix, which can be directly coerced to a dataframe.
df <- data.frame(str_split_fixed(info, "@", 3))
You then remove the empty rows at the bottom of the df.
df <- df[df$Country != "",]
Sample of df:
R
library(rvest)
library(stringr)
library(magrittr)
page <- read_html('http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates')
info = strsplit(paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = ''),"!,")[[1]]
df <- data.frame(str_split_fixed(info, "@", 3))
colnames(df) <- c("Country","Link","Tax")
df <- subset(df, select = c("Country","Tax"))
df <- df[df$Country != "",]
View(df)
Python:
I did this first in python as was quicker for me:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates')
soup = bs(r.content, 'lxml')
text = ''
for i in soup.select('[id^=dspQCLinks]'):
text+= i.text
rows = text.split('!,')
countries = []
tax_info = []
for row in rows:
if row:
items = row.split('@')
countries.append(items[0])
tax_info.append(items[2])
df = pd.DataFrame(list(zip(countries,tax_info)))
print(df)
Reading:
Upvotes: 1