Huy
Huy

Reputation: 11206

Removing backslash (escape character) from a string

I am trying to work on my own JSON parser. I have an input string that I want to tokenize:

input = "{ \"foo\": \"bar\", \"num\": 3}"

How do I remove the escape character \ so that it is not a part of my tokens?

Currently, my solution using delete works:

tokens = input.delete('\\"').split("")

=> ["{", " ", "f", "o", "o", ":", " ", "b", "a", "r", ",", " ", "n", "u", "m", ":", " ", "3", "}"]

However, when I try to use gsub, it fails to find any \".

tokens = input.gsub('\\"', '').split("")

=> ["{", " ", "\"", "f", "o", "o", "\"", ":", " ", "\"", "b", "a", "r", "\"", ",", " ", "\"", "n", "u", "m", "\"", ":", " ", "3", "}"]

I have two questions:

1. Why does gsub not work in this case?

2. How do I remove the backslash (escape) character? I currently have to remove the backslash character with the quotes to make this work.

Upvotes: 27

Views: 47030

Answers (5)

Pavel Kalashnikov
Pavel Kalashnikov

Reputation: 2091

In terms of the reasons, why this string appeared?

Just in case, check your code for repeating of the to_json method on Hash or something else.

{ "foo": "bar", "num": 3}.to_json #=> { "foo": "bar", "num": 3}
{ "foo": "bar", "num": 3}.to_json.to_json #=> "{ \"foo\": \"bar\", \"num\": 3}"

Upvotes: 0

Dan
Dan

Reputation: 1288

input.gsub(/[\"]/,"") will also work.

Upvotes: 6

Amadan
Amadan

Reputation: 198324

You do not have backslashes in your string. You have quotes in your string, which need to be escaped when placed in a double-quoted string. Look:

input = "{ \"foo\": \"bar\", \"num\": 3}"
puts input
# => { "foo": "bar", "num": 3}

You are removing - phantoms.

input.delete('\\"')

will delete any characters in its argument. Thus, you delete any non-existent backslashes, and also delete all quotes. Without quotes, the default display method (inspect) will not need to escape anything.

input.gsub('\\"', '')

will try to delete the sequence \", which does not exist, so gsub ends up doing nothing.

Make sure you know what the difference between string representation (puts input.inspect) and string content (puts input) is, and note the backslashes as the artifacts of the representation.

That said, I have to echo emaillenin: writing a correct JSON parser is not simple, and you can't do it with regular expressions (or at least, not with regular regular expressions; it might be possible with Oniguruma). It needs a proper parser like treetop or rex/racc, since it has a lot of corner cases that are easy to miss (chief among them being, ironically, escaped characters).

Upvotes: 13

Arie Xiao
Arie Xiao

Reputation: 14082

When you write:

input = "{ \"foo\": \"bar\", \"num\": 3}"

The actual string stored in input is:

{ "foo": "bar", "num": 3}

The escape \" here is interpreted by Ruby parser, so that it can distinguish between the boundary of a string (the left most and the right most "), and a normal character " in a string (the escaped ones).

String#delete deletes a character set specified the first parameter, rather than a pattern. All characters that is in the first parameter will be removed. So by writing

input.delete('\\"')

You got a string with all \ and " removed from input, rather than a string with all \" sequence removed from input. This is wrong for your case. It may cause unexpected behavior some time later.

String#gsub, however, substitute a pattern (either regular expression or plain string).

input.gsub('\\"', '')

means find all \" (two characters in a sequence) and replace them with empty string. Since there isn't \ in input, nothing got replaced. What you need is actually:

input.gsub('"', '')

Upvotes: 39

Raj
Raj

Reputation: 22926

Use regex pattern:

> input = "{ \"foo\": \"bar\", \"num\": 3}"
> input.gsub(/"/,'').split("")

> => ["{", " ", "f", "o", "o", ":", " ", "b", "a", "r", ",", " ", "n", "u", "m", ":", " ", "3", "}"]

That is actually a double quote only. The slash is to escape it.

Upvotes: 2

Related Questions