Reputation: 1112
I would like to convert HTML character entities like
&
to &
or
>
to >
For Perl exists the package HTML::Entities which could do that, but I couldn't find something similar in R.
I also tried iconv()
but couldn't get satisfying results. Maybe there is also a way using the XML
package but I haven't figured it out yet.
Upvotes: 26
Views: 14936
Reputation: 11110
library(xml2)
xml_text(read_html(charToRaw("& >")))
gives:
[1] "& >"
Upvotes: 3
Reputation: 98
Based on Stibu's answer, I went to benchmark the functions.
# first create large vector as in Stibu's answer
set.seed(123)
strings <- c("abcd", "& ' >", "&", "€ <")
many_strings <- sample(strings, 10000, replace = TRUE)
# then benchmark the functions by Stibu and Jeroen
bench::mark(
textutils::HTMLdecode(many_strings),
map_chr(many_strings, unescape_html),
unescape_html2(many_strings)
)
# A tibble: 3 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <lis>
1 textutils::HTMLdecode(many_strings) 855.02ms 855.02ms 1.17 329.18MB 10.5 1 9 855.02ms <chr … <Rpro… <bch…
2 map_chr(many_strings, unescape_html) 1.09s 1.09s 0.919 6.79MB 5.51 1 6 1.09s <chr … <Rpro… <bch…
3 unescape_html2(many_strings) 4.85ms 5.13ms 195. 581.48KB 0 98 0 503.63ms <chr … <Rpro… <bch…
# … with 1 more variable: gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
Here I vectorize Jeroen's unescape_html
function by purrr::map_chr
operator. So far, this just confirms Stibu's claim that the unescape_html2
is indeed many times faster! It is even way faster than textutils::HTMLdecode
function.
But I also found that the xml
version could be even faster.
unescape_xml2 <- function(str){
html <- paste0("<x>", paste0(str, collapse = "#_|"), "</x>")
parsed <- xml2::xml_text(xml2::read_xml(html))
strsplit(parsed, "#_|", fixed = TRUE)[[1]]
}
However, this function fails when dealing with the many_strings
object (maybe because read_xml
can not read Euro symbol. So I have to try a different way for benchmarking.
library(tidyverse)
library(rvest)
entity_html <- read_html("https://dev.w3.org/html5/html-author/charref")
entity_mapping <- entity_html %>%
html_node(css = "table") %>%
html_table() %>%
rename(text = X1,
named = X2,
hex = X3,
dec = X4,
desc = X5) %>%
as_tibble
s2 <- entity_mapping %>% pull(dec) # dec can be replaced by hex or named
bench::mark(
textutils::HTMLdecode(s2),
map_chr(s2, unescape_xml),
map_chr(s2, unescape_html),
unescape_xml2(s2),
unescape_html2(s2)
)
# A tibble: 5 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 textutils::HTMLdecode(s2) 191.7ms 194.9ms 5.16 64.1MB 10.3 3 6 582ms <chr … <Rprofm… <bch:… <tibb…
2 map_chr(s2, unescape_xml) 73.8ms 80.9ms 11.9 1006.9KB 5.12 7 3 586ms <chr … <Rprofm… <bch:… <tibb…
3 map_chr(s2, unescape_html) 162.4ms 163.7ms 5.83 1006.9KB 5.83 3 3 514ms <chr … <Rprofm… <bch:… <tibb…
4 unescape_xml2(s2) 459.2µs 473µs 2034. 37.9KB 2.00 1017 1 500ms <chr … <Rprofm… <bch:… <tibb…
5 unescape_html2(s2) 590µs 607.5µs 1591. 37.9KB 2.00 796 1 500ms <chr … <Rprofm… <bch:… <tibb…
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
We can also try on hex
ones.
> bench::mark(
+ # gsubreplace_mapping(s2, entity_mapping),
+ # gsubreplace_local(s2),
+ textutils::HTMLdecode(s3),
+ map_chr(s3, unescape_xml),
+ map_chr(s3, unescape_html),
+ unescape_xml2(s3),
+ unescape_html2(s3)
+ )
# A tibble: 5 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 textutils::HTMLdecode(s3) 204.2ms 212.3ms 4.72 64.1MB 7.87 3 5 636ms <chr … <Rprofm… <bch:… <tibb…
2 map_chr(s3, unescape_xml) 76.4ms 80.2ms 11.8 1006.9KB 5.04 7 3 595ms <chr … <Rprofm… <bch:… <tibb…
3 map_chr(s3, unescape_html) 164.6ms 165.3ms 5.80 1006.9KB 5.80 3 3 518ms <chr … <Rprofm… <bch:… <tibb…
4 unescape_xml2(s3) 487.4µs 500.5µs 1929. 74.5KB 2.00 965 1 500ms <chr … <Rprofm… <bch:… <tibb…
5 unescape_html2(s3) 611.1µs 627.7µs 1574. 40.4KB 0 788 0 501ms <chr … <Rprofm… <bch:… <tibb…
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
Here the xml
version is even more faster than the html
version.
Upvotes: 1
Reputation: 15897
While the solution by Jeroen does the job, it has the disadvantage that it is not vectorised and therefore slow if applied to a large number of characters. In addition, it only works with a character vector of length one and one has to use sapply
for a longer character vector.
To demonstrate this, I first create a large character vector:
set.seed(123)
strings <- c("abcd", "& ' >", "&", "€ <")
many_strings <- sample(strings, 10000, replace = TRUE)
And apply the function:
unescape_html <- function(str) {
xml2::xml_text(xml2::read_html(paste0("<x>", str, "</x>")))
}
system.time(res <- sapply(many_strings, unescape_html, USE.NAMES = FALSE))
## user system elapsed
## 2.327 0.000 2.326
head(res)
## [1] "& ' >" "€ <" "& ' >" "€ <" "€ <" "abcd"
It is much faster if all the strings in the character vector are combined into a single, large string, such that read_html()
and xml_text()
need only be used once. The strings can then easily be separated again using strsplit()
:
unescape_html2 <- function(str){
html <- paste0("<x>", paste0(str, collapse = "#_|"), "</x>")
parsed <- xml2::xml_text(xml2::read_html(html))
strsplit(parsed, "#_|", fixed = TRUE)[[1]]
}
system.time(res2 <- unescape_html2(many_strings))
## user system elapsed
## 0.011 0.000 0.010
identical(res, res2)
## [1] TRUE
Of course, you need to be careful that the string that you use to combine the various strings in str
("#_|"
in my example) does not appear anywhere in str
. Otherwise, you will introduce an error, when the large string is split again in the end.
Upvotes: 8
Reputation: 5358
Update: this answer is outdated. Please check the answer below based on the new xml2 pkg.
Try something along the lines of:
# load XML package
library(XML)
# Convenience function to convert html codes
html2txt <- function(str) {
xpathApply(htmlParse(str, asText=TRUE),
"//body//text()",
xmlValue)[[1]]
}
# html encoded string
( x <- paste("i", "s", "n", "&", "a", "p", "o", "s", ";", "t", sep = "") )
[1] "isn't"
# converted string
html2txt(x)
[1] "isn't"
UPDATE: Edited the html2txt() function so it applies to more situations
Upvotes: 21
Reputation: 32978
Unescape xml/html values using xml2
package:
unescape_xml <- function(str){
xml2::xml_text(xml2::read_xml(paste0("<x>", str, "</x>")))
}
unescape_html <- function(str){
xml2::xml_text(xml2::read_html(paste0("<x>", str, "</x>")))
}
Examples:
unescape_xml("3 < x & x > 9")
# [1] "3 < x & x > 9"
unescape_html("€ 2.99")
# [1] "€ 2.99"
Upvotes: 30