Greg
Greg

Reputation: 3650

String seeming to be one single space character, but isn't

I'm doing some web scraping using rvest and I've come across something odd. There's a string that looks like " " but isn't. I've reproduced this on two computers, a Mac OSX system running R 3.6.3 and a Windows 10 system running R 3.6.3.

library(rvest)
library(stringr)
# scrape website, no issue
webpage <- rvest::read_html("https://www.usms.org/longdist/ldnats00/1hrf4044.php")
html <- rvest::html_nodes(webpage, css = "td")
results <- rvest::html_text(html)
# cleaning results a bit, no issue
results <- stringr::str_replace(results, "\\\r\\\n", "")
results <- results[results != ""]
# the mystery string
results[605]
[1] " "

If I compare results[605] with " ", or with the copy-and-pasted result of printing results[605]

results[605] == " "
[1] FALSE

If I store results[605] in a value

string_605 <- results[605]
string_605
[1] " "
results[605] == string_605
[1] TRUE
string_605 == " "
[1] FALSE

Just as a sanity check

" " == " "
[1] TRUE

What is this mystery string and how do I match it? I'd like to get rid of it like results <- results[results != mystery string]

Upvotes: 1

Views: 54

Answers (1)

Frank Zhang
Frank Zhang

Reputation: 1688

The string here is <U+00A0>

My solution is always try to clipr::write_clip(results[605]) and paste into whatever place. Then you can see the code of this string also can paste into google to search it :)

After you can do this results <- results[results != '\U00A0']

Upvotes: 2

Related Questions