Reputation: 91
I am using Rstudio as a web scraper right now. But I have an issue.
page_html <- read_html("http://competitie.vttl.be/index.php?menu=6&sel=36665&result=1&category=1")
> page_html %>% html_nodes("td:nth-child(1) :nth-child(2) :nth-child(3).DBTable_first") %>% html_text()
[1] "A [+]"
> identical((page_html %>% html_nodes("td:nth-child(1) :nth-child(2) :nthchild(3) .DBTable_first") %>% html_text()),"A [+]")
[1] FALSE
> page_html %>% html_nodes("td:nth-child(1) :nth-child(2) :nth-child(4).DBTable_first") %>% html_text()
[1] "B0"
> identical((page_html %>% html_nodes("td:nth-child(1) :nth-child(2) :nthchild(4) .DBTable_first") %>% html_text()),"B0")
[1] TRUE
A [+] always returns false and I don't know why. I compared it with someone else that returns true with exactly the same method. Does anyone know how to solve this?
Upvotes: 1
Views: 62
Reputation: 1297
The webpage is using UTF-8 encoding, which seems to be causing the issue.
library(rvest)
page_html <- read_html("http://competitie.vttl.be/index.php?menu=6&sel=36665&result=1&category=1")
grade <- page_html %>% html_nodes("td:nth-child(1) :nth-child(2) :nth-child(3) .DBTable_first") %>% html_text()
grade
[1] "A [+]"
Encoding(grade)
[1] "UTF-8"
Encoding(grade) <- "unknown"
grade
[1] "AÂ [+]"
Notice the extra character!
One solution is
grade <- page_html %>% html_nodes("td:nth-child(1) :nth-child(2) :nth-child(3) .DBTable_first") %>% html_text()
grade <- iconv(grade, "UTF-8", "ASCII", "")
identical(grade,"A[+]")
[1] TRUE
NB converting from UTF-8 to ASCII removes the space, so the comparison is now to "A[+]"
BTW I had to adjust the css selector string in html_nodes
to get this to work.
Upvotes: 2