Reputation: 25983
In ruby, how do I decode c-style escape sequences? e.g. '\n' to a newline, '\t' to a tab?
Upvotes: 7
Views: 2071
Reputation: 132979
The following code will correctly decode all escape sequences defined by the ISO-C standard. It is save and reasonable performant:
ISO_C_ESCAPE_SEQUENCES = %r{
# One letter escapes
(?:\\[abfnrtv\\'"?])
# Hex encoded character
| (?:\\(x)([A-Fa-f0-9]{2,}))
# Any Unicode code point (8 hex digits) or
# Unicode code point below 1000 (4 hex digits)
| (?:\\(u)((?:[A-Fa-f0-9]{8})|(?:[A-Fa-f0-9]{4})))
# Octal encoded character
| (?:\\([0-7]{1,3}))
}x
ISO_C_ONE_LETTER_ESCAPES = {
"\\a" => "\a",
"\\b" => "\b",
"\\f" => "\f",
"\\n" => "\n",
"\\r" => "\r",
"\\t" => "\t",
"\\v" => "\v",
"\\\\" => "\\",
"\\'" => "'",
"\\\"" => "\"",
"\\?" => "?"
}
def decodeCString( cString )
return cString.gsub(ISO_C_ESCAPE_SEQUENCES) { |match|
replacement = ISO_C_ONE_LETTER_ESCAPES[match]
next replacement if replacement
next $2.to_i(16).chr if $1 == "x"
next $4.to_i(16).chr(Encoding::UTF_8) if $3 == "u"
next $5.to_i(8).chr
}
end
Here's a sample:
puts decodeCString("Line \\\\n Same Line!\\nNew line\\x0ANew line")
puts decodeCString("Smiley: \\u263A\tHorse head: \\u00010083")
puts decodeCString("Equal sign in quotes: \\\"\\75\\\"")
prints
Line \n Same Line!
New line
New line
Smiley: ☺ Horse head: 𐂃
Equal sign in quotes: "="
Upvotes: 1
Reputation: 369458
EDIT: Note that this does not actually work. You really need to build a proper parser here with a state machine that keeps track of whether you are in an escape sequence or not.
Ruby supports many of the same escape sequences, so you could build a simple translation table like this:
T = {
'\n' => "\n",
'\t' => "\t",
'\r' => "\r"
}
And then use that translation table to replace those sequences in the source string:
a = '1\t2\n3'
a.gsub(/#{T.keys.map(&Regexp.method(:escape)).join('|')}/, &T.method(:[]))
# => "1\t2\n3"
Upvotes: 0
Reputation: 17104
Okay, if you don't like eval
solution, I've hacked a simple state machine in Ruby to parse simple "\n" and "\t" in strings correctly, including pre-escaping of backslash itself. Here it is:
BACKSLASH = "\\"
def unescape_c_string(s)
state = 0
res = ''
s.each_char { |c|
case state
when 0
case c
when BACKSLASH then state = 1
else res << c
end
when 1
case c
when 'n' then res << "\n"; state = 0
when 't' then res << "\t"; state = 0
when BACKSLASH then res << BACKSLASH; state = 0
else res << BACKSLASH; res << c; state = 0
end
end
}
return res
end
This one can be easily extended to support more characters, including multi-character entities, like \123
. Test unit to prove that it works:
require 'test/unit'
class TestEscapeCString < Test::Unit::TestCase
def test_1
assert_equal("abc\nasd", unescape_c_string('abc\nasd'))
end
def test_2
assert_equal("abc\tasd", unescape_c_string('abc\tasd'))
end
def test_3
assert_equal("abc\\asd", unescape_c_string('abc' + BACKSLASH * 2 + 'asd'))
end
def test_4
assert_equal("abc\\nasd", unescape_c_string('abc' + BACKSLASH * 2 + 'nasd'))
end
def test_5
assert_equal("abc\\\nasd", unescape_c_string('abc' + BACKSLASH * 3 + 'nasd'))
end
def test_6
assert_equal("abc\\\\nasd", unescape_c_string('abc' + BACKSLASH * 4 + 'nasd'))
end
end
Upvotes: 12
Reputation: 17104
Shorter, even more hacky and fairly dangerous, due to eval:
A simple example:
eval "\"#{string}\""
> a = '1\t2\n3'
> puts a
1\t2\n3
> puts eval "\"#{a}\""
1 2
3
Upvotes: 3