Reputation: 635
I need to create a function which extracts and changes a part a of word. It would be to convert Unicode to a specific form of UTF-8.
My input would be for instance
word = "Aul<U+00E9>n"
My output would be
f(word) = "Aul%c3%a9n"
I don't know how to select only the <U+00E9>
part in the first word.
Does anyone have a idea how to do that ? Thanks in advance !
Upvotes: 1
Views: 143
Reputation: 24074
It's too long for comment but what I meant in my last comment is:
you can build a correspondences data.frame like:
corresp <- data.frame(uni=c("<U+00E9>", "U+00EC"), utf=c("%c3%a9", "%c3%ac"), stringsAsFactors=F)
Then you can define a recode function, e.g. like:
recode <- function(word, corresp){
code <- sub("[^<]*(<U.+>)[^>]+", "\\1", word)
m_code <- corresp$utf[corresp$uni==code]
return(sub(code, m_code, word))
}
And so:
recode("Aul<U+00E9>n", corresp)
#[1] "Aul%c3%a9n"
Upvotes: 3
Reputation: 837
Try it, please install.packages("Unicode")
before run.
word = "Aul<U+00E9>n"
start<-regexpr("<.*?>",word)
end<-a+attr(x = a,which = "match.length")
unipart<-Unicode::u_char_inspect(substr(word,start+3,end-2))[3]
paste(substr(word,1,start-1),paste("%",paste(iconv(unipart,toRaw = T)[[1]],collapse="%"),sep=""),substr(word,end,nchar(word)),sep = "")
>[1] "Aul%c3%a9n"
Upvotes: 0