M.D.
M.D.

Reputation: 91

string equals false although it's true

I am using Rstudio as a web scraper right now. But I have an issue.

page_html <- read_html("http://competitie.vttl.be/index.php?menu=6&sel=36665&result=1&category=1")

> page_html %>% html_nodes("td:nth-child(1) :nth-child(2) :nth-child(3).DBTable_first") %>% html_text()
[1] "A [+]"
> identical((page_html %>% html_nodes("td:nth-child(1) :nth-child(2) :nthchild(3) .DBTable_first") %>% html_text()),"A [+]")
[1] FALSE
> page_html %>% html_nodes("td:nth-child(1) :nth-child(2) :nth-child(4).DBTable_first") %>% html_text()
[1] "B0"
> identical((page_html %>% html_nodes("td:nth-child(1) :nth-child(2) :nthchild(4) .DBTable_first") %>% html_text()),"B0")
[1] TRUE

A [+] always returns false and I don't know why. I compared it with someone else that returns true with exactly the same method. Does anyone know how to solve this?

Upvotes: 1

Views: 62

Answers (1)

Jeremy Voisey
Jeremy Voisey

Reputation: 1297

The webpage is using UTF-8 encoding, which seems to be causing the issue.

library(rvest)
page_html <- read_html("http://competitie.vttl.be/index.php?menu=6&sel=36665&result=1&category=1")
grade <- page_html %>% html_nodes("td:nth-child(1) :nth-child(2) :nth-child(3) .DBTable_first") %>% html_text()
grade
[1] "A [+]"
Encoding(grade)
[1] "UTF-8"
Encoding(grade) <- "unknown"
grade
[1] "AÂ [+]"

Notice the extra character!

One solution is

 grade <- page_html %>% html_nodes("td:nth-child(1) :nth-child(2) :nth-child(3) .DBTable_first") %>% html_text()
 grade <- iconv(grade, "UTF-8", "ASCII", "")
 identical(grade,"A[+]")
[1] TRUE

NB converting from UTF-8 to ASCII removes the space, so the comparison is now to "A[+]"

BTW I had to adjust the css selector string in html_nodes to get this to work.

Upvotes: 2

Related Questions