val
val

Reputation: 1699

webscraping using rvest package comes out empty

There is a table of taxes by country at the link below that I would like to scrape into a dataframe with Country and Tax columns.

I've tried using the rvest package as follows to get my Country column but the list I generate is empty and I don't understand why.

I would appreciate any pointers on resolving this problem.

library(rvest)
d1 <- read_html(
  "http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates"
  )
TaxCountry <- d1 %>%
  html_nodes('.countryNameQC') %>%
  html_text()

Upvotes: 0

Views: 222

Answers (1)

QHarr
QHarr

Reputation: 84465

The data is dynamically loaded and the DOM altered when javascript runs in the browser. This doesn't happen with rvest.

The following selectors, in the browser, would have isolated your nodes:

.twoCountryWrapper .countryNameAndYearQC:nth-child(1) .countryNameQC
.twoCountryWrapper .countryNameAndYearQC:nth-child(1) .countryYear 
.twoCountryWrapper .countryNameAndYearQC:nth-child(2) .countryNameQC
.twoCountryWrapper .countryNameAndYearQC:nth-child(2) .countryYear

But, those classes are not even present in rvest return.

The data of interest is actually stored in several nodes; all of which have ids within a common prefix of dspQCLinks. The data inside looks like as follows:

enter image description here

So, you can gather all those nodes using css attribute = value with starts with operator (^) syntax:

html_nodes(page, "[id^=dspQCLinks]")

Then extract the text and combine into one string

paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = '')

Now each row in your table is delimited by !, , so we can split on that to generate the rows:

info = strsplit(paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = ''),"!,")[[1]]

An example row would then look like:

"Albania@/uk/taxsummaries/wwts.nsf/ID/Albania-Corporate-Taxes-on-corporate-income@15"

If we split each row on the @, the data we want is at indices 1 and 3:

arr = strsplit(i, '@')[[1]]
country <- arr[1]
tax <- arr[3]

Thanks to @Brian's feedback I have removed the loop I had to build the dataframe and replaced with, to quote @Brian, str_split_fixed(info, "@", 3) [which] gives you a character matrix, which can be directly coerced to a dataframe.

df <- data.frame(str_split_fixed(info, "@", 3))

You then remove the empty rows at the bottom of the df.

 df <- df[df$Country != "",] 

Sample of df:

enter image description here


R

library(rvest)
library(stringr)
library(magrittr)

page <- read_html('http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates')
info =  strsplit(paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = ''),"!,")[[1]]
df <- data.frame(str_split_fixed(info, "@", 3))
colnames(df) <- c("Country","Link","Tax")
df <- subset(df, select = c("Country","Tax"))
df <- df[df$Country != "",] 
View(df)

Python:

I did this first in python as was quicker for me:

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

r = requests.get('http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates')
soup = bs(r.content, 'lxml')
text = ''

for i in soup.select('[id^=dspQCLinks]'):
    text+= i.text

rows = text.split('!,')
countries = []
tax_info = []

for row in rows:
    if row:
        items = row.split('@')
        countries.append(items[0])
        tax_info.append(items[2])

df = pd.DataFrame(list(zip(countries,tax_info)))
print(df)

Reading:

  1. str_split_fixed

Upvotes: 1

Related Questions