agbarnett
agbarnett

Reputation: 180

Replacing non-ascii dash with hyphen in R

I am reading in XML files into R that have varied formats. I read them in using UTF-8 encoding. I'm having trouble replacing non-ascii negative signs which look like "−". I can't simply expunge non-ascii characters because I want to keep the negative sign. So the gsub below does not work and I've tried lots of different options for the pattern.

in_text = "<td align=\"left\" rowspan=\"1\" colspan=\"1\">−0.68 (1.04)</td>"
gsub(pattern='−', replacement='-', in_text)
<td align=\"left\" rowspan=\"1\" colspan=\"1\">−0.68 (1.04)</td>

I can see they are non-ASCII:

tools::showNonASCII(in_text)
<td align="left" rowspan="1" colspan="1"><e2><88><92>0.68 (1.04)</td>

Upvotes: 2

Views: 815

Answers (1)

MrFlick
MrFlick

Reputation: 206253

It doesn't appear that what you've posted in the question actually has the non-ascii character. I think your source matches this

in_text = "<td align=\"left\" rowspan=\"1\" colspan=\"1\">\u22120.68 (1.04)</td>"
in_text
# [1] "<td align=\"left\" rowspan=\"1\" colspan=\"1\">−0.68 (1.04)</td>"

The character "\u2212" seems to match the output you get from tools::showNonASCII. So if you use that escaped character in the replace it should work fine

gsub(pattern='\u2212', replacement='-', in_text)
# [1] "<td align=\"left\" rowspan=\"1\" colspan=\"1\">-0.68 (1.04)</td>"

Upvotes: 2

Related Questions