Tesserex
Tesserex

Reputation: 17314

Challenge - escape this text, safely yet accurately

This is a follow up to my last question here. The answer posted there actually does not work. So here is the challenge. You are given this code (assume jQuery included):

<input type=text>
<script>
    $("input").val(**YOUR PHP / JS CODE HERE**);
</script>

Using jQuery - and not by injecting PHP output directly into the input tag - faithfully reproduce ANY text from the database in the input tag. If the database field says </script>, the field should say that too. If has Chinese in it, double quotes, whatever, reproduce that too. Assume your PHP variable is called $text.

Here are some of my failed attempts.

1)

$("input").val("<?= htmlentities($text); ?>");

FAILURE: Reproduces character encoding exactly as is in text fields.
INPUT: $text = "Déjà vu"
OUTPUT: Field contains literal d&eacute;j&agrave; vu

2)

$("input").val(<?= json_encode($text); ?>);

This was suggested as the answer in my last question, and I naively accepted it. However...
FAILURE: json_encode only works with UTF-8 characters.
INPUT: $text = "Va e de här fö frågor egentlien"
OUTPUT: Field is blank, because json_encode returns null.

3)

var temp = $("<div></div>").html("<?= htmlentities($text); ?>");
$("input").val(temp.html());

This was my most promising solution for the weird characters, except...
FAILURE: Does not encode some characters (not sure exactly which, don't care)
INPUT: $text = "</script> Déjà"
OUTPUT: Field contains &lt;/script&gt; Déjà

4) Suggested in answers

$("input").val(unescape("<?= urlencode($text); ?>"));

FAILURE: Spaces remain encoded as +'s.

$("input").val(unescape(<?= rawurlencode($text); ?>"));

Almost works. All previous input succeeds, but multibyte stuff, like kanji, remain encoded. decodeURIComponent also doesn't like multibyte characters.

Note that for me, things like strip_tags are not an option. Everything must be allowed. People are authoring quizzes with this, and if someone wants to make a quiz that tests your knowledge of HTML, so be it. Also, unfortunately I cannot just inject the htmlentities escaped text into the value field of the input tags. These tags are generated dynamically, and I would have to totally tear down my current javascript code structure to do it that way.

I feel like I'm SOL here. Please show me how wrong I am.

EDIT

Assume the user initally entered </script> Déjà här fö frågor 漢字 into the db. This would be stored (you would see it in phpMyAdmin) as </script> Déjà här fö frågor &#28450;&#23383;

Upvotes: 2

Views: 318

Answers (6)

Scott Jungwirth
Scott Jungwirth

Reputation: 6675

safe javascript escaping for ascii strings.

<?php
function js_encode($string)
{
    $cleaned = is_null($string) ? null : '';

    // for each letter of the string
    for ($i=0, $len = strlen($string); $i < $len; $i++)
    {
        // get ascii number
        $ord = ord($string[$i]);
        // if [0-9] or [A-Z] or [a-z]
        $cleaned .= (47 < $ord && $ord < 58 OR 64 < $ord && $ord < 91 OR 96 < $ord && $ord < 123)
            // use existing character
            ? $string[$i]
            // otherwise escape it
            : '\x'.dechex($ord);
    }

    return $cleaned;
}

for unicode text it is a little more complicated, I am going to start with this and see if I need to do the more complex version.

Upvotes: 0

Tesserex
Tesserex

Reputation: 17314

I have found a "good enough" solution that you all might find interesting.

  1. utf8_encode the string on the way into the database. This makes sure that it can be safely handled on the way out by the following steps.

2.

function repl($match)
{
    return "\u" . dechex($match[1]);
}

function esc($string)
{
    $s = json_encode($string);
    $s = preg_replace_callback("/&#([0-9]+);/", "repl", $s);
    return $s;
}

This isn't absolutely perfect, because there doesn't seem to be any way for the php to know the difference between the user typing 漢 or literally typing &#28450;. So if you type the latter it will become the former. But I doubt anyone will ever want to do that anyway.

Upvotes: 1

Walter Mundt
Walter Mundt

Reputation: 25271

What encoding is your text in, if not UTF-8? If you don't know, you don't have text, you have a byte sequence, which is much harder to faithfully represent. If you do know, you can do something like this using the PHP multibyte string extension:

$("input").val(<?= json_encode(mb_convert_encoding($text, "UTF-8", "ISO-8859-1")); ?>);

Here I've presumed your input is in ISO-8859-1 aka Latin-1 encoding, which is a pretty common case for database strings.

EDIT: This is in response to the comments about a closing script tag. I made this test file and it displays properly for me, at least in Firefox 3.6:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
    <title>Test</title>
    <script src='http://code.jquery.com/jquery-1.4.2.js'></script>
</head>
<form name='foo'>
    <input name='bar' id='bar'/>
</form>
<script language="JavaScript">
    $('input').val("<\/script>");
</script>
</html>

Upvotes: 1

Dolph
Dolph

Reputation: 50700

You need to encode in PHP, and decode in JavaScript...

PHP's rawurlencode():

echo rawurlencode("</script> Déjà");
//result: %3C%2Fscript%3E+D%C3%A9j%C3%A0

JavaScript's decodeURIComponent():

var encoded = "%3C%2Fscript%3E+D%C3%A9j%C3%A0";
alert(decodeURIComponent(encoded));
//result: </script> Déjà

Upvotes: 1

Artefacto
Artefacto

Reputation: 97835

You can use:

Upvotes: 0

user268396
user268396

Reputation: 11996

You may want to use urlencode() and urldecode().

Upvotes: 0

Related Questions