Reputation: 3650
I'm doing some web scraping using rvest
and I've come across something odd. There's a string that looks like " "
but isn't. I've reproduced this on two computers, a Mac OSX system running R 3.6.3 and a Windows 10 system running R 3.6.3.
library(rvest)
library(stringr)
# scrape website, no issue
webpage <- rvest::read_html("https://www.usms.org/longdist/ldnats00/1hrf4044.php")
html <- rvest::html_nodes(webpage, css = "td")
results <- rvest::html_text(html)
# cleaning results a bit, no issue
results <- stringr::str_replace(results, "\\\r\\\n", "")
results <- results[results != ""]
# the mystery string
results[605]
[1] " "
If I compare results[605]
with " "
, or with the copy-and-pasted result of printing results[605]
results[605] == " "
[1] FALSE
If I store results[605]
in a value
string_605 <- results[605]
string_605
[1] " "
results[605] == string_605
[1] TRUE
string_605 == " "
[1] FALSE
Just as a sanity check
" " == " "
[1] TRUE
What is this mystery string and how do I match it? I'd like to get rid of it like results <- results[results != mystery string]
Upvotes: 1
Views: 54
Reputation: 1688
The string here is <U+00A0>
My solution is always try to clipr::write_clip(results[605])
and paste into whatever place. Then you can see the code of this string also can paste into google to search it :)
After you can do this results <- results[results != '\U00A0']
Upvotes: 2