Jb_Eyd
Jb_Eyd

Reputation: 635

Extracting a subset of character form a word with R

I need to create a function which extracts and changes a part a of word. It would be to convert Unicode to a specific form of UTF-8.

My input would be for instance

word = "Aul<U+00E9>n"

My output would be

f(word) = "Aul%c3%a9n"

I don't know how to select only the <U+00E9> part in the first word.

Does anyone have a idea how to do that ? Thanks in advance !

Upvotes: 1

Views: 143

Answers (2)

Cath
Cath

Reputation: 24074

It's too long for comment but what I meant in my last comment is:

you can build a correspondences data.frame like:

corresp <- data.frame(uni=c("<U+00E9>", "U+00EC"), utf=c("%c3%a9", "%c3%ac"), stringsAsFactors=F)

Then you can define a recode function, e.g. like:

recode <- function(word, corresp){
              code <- sub("[^<]*(<U.+>)[^>]+", "\\1", word)
              m_code <- corresp$utf[corresp$uni==code]
              return(sub(code, m_code, word))
          }

And so:

recode("Aul<U+00E9>n", corresp)
#[1] "Aul%c3%a9n"

Upvotes: 3

vck
vck

Reputation: 837

Try it, please install.packages("Unicode") before run.

word = "Aul<U+00E9>n"
start<-regexpr("<.*?>",word)
end<-a+attr(x = a,which = "match.length")
unipart<-Unicode::u_char_inspect(substr(word,start+3,end-2))[3]
paste(substr(word,1,start-1),paste("%",paste(iconv(unipart,toRaw = T)[[1]],collapse="%"),sep=""),substr(word,end,nchar(word)),sep = "")

>[1] "Aul%c3%a9n"

Upvotes: 0

Related Questions