Omar Elewa
Omar Elewa

Reputation: 372

How to escape special charcters?

I am using a html purifier package for purifying my rich text from any xss before storing in database.

But my rich text allows for Wiris symbols which uses special character as → or  .

Problem is the package does not allow me to escape these characters. It removes them completely. What should I do to escape them ??

Example of the string before purifying

<p><math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>x</mi><mn>2</mn></msup><mo>&#160;</mo><mo>+</mo><mo>&#160;</mo><mmultiscripts><mi>y</mi><mprescripts/><none/><mn>2</mn></mmultiscripts><mo>&#160;</mo><mover><mo>&#8594;</mo><mo>=</mo></mover><mo>&#160;</mo><msup><mi>z</mi><mn>2</mn></msup><mo>&#160;</mo></math></p>

After purifying

<p><math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>x</mi><mn>2</mn></msup><mo> </mo><mo>+</mo><mo> </mo><mmultiscripts><mi>y</mi><mprescripts></mprescripts><none><mn>2</mn></mmultiscripts><mo> </mo><mover><mo>→</mo><mo>=</mo></mover><mo> </mo><msup><mi>z</mi><mn>2</mn></msup><mo> </mo></math></p>

Upvotes: 1

Views: 570

Answers (2)

Omar Elewa
Omar Elewa

Reputation: 372

I solved the problem by setting key Core.EscapeNonASCIICharacters to true

under my default key in my purifier.php file and the problem has gone.

Upvotes: 1

pinkgothic
pinkgothic

Reputation: 6179

My guess is that these entities are failing the regexen that HTML Purifier is using to check for valid entities in HTMLPurifier_EntityParser, here:

         $this->_textEntitiesRegex =
             '/&(?:'.
             // hex
             '[#]x([a-fA-F0-9]+);?|'.
             // dec
             '[#]0*(\d+);?|'.
             // string (mandatory semicolon)
             // NB: order matters: match semicolon preferentially
             '([A-Za-z_:][A-Za-z0-9.\-_:]*);|'.
             // string (optional semicolon)
             "($semi_optional)".
             ')/';
 
         $this->_attrEntitiesRegex =
             '/&(?:'.
             // hex
             '[#]x([a-fA-F0-9]+);?|'.
             // dec
             '[#]0*(\d+);?|'.
             // string (mandatory semicolon)
             // NB: order matters: match semicolon preferentially
             '([A-Za-z_:][A-Za-z0-9.\-_:]*);|'.
             // string (optional semicolon)
             // don't match if trailing is equals or alphanumeric (URL
             // like)
             "($semi_optional)(?![=;A-Za-z0-9])".
             ')/';

Notice how it expects numeric entities to start with 0 currently. (Perfectly sane since it's designed to handle pure HTML, without add-ons, and to make that safe; but in your use-case, you want more entity flexibility.)

You could extend that class and overwrite the constructor (where these regexen are being defined, by instead defining your own where you remove the 0* from the // dec part of the regexen), instantiating that, try setting $this->_entity_parser on a Lexer created with HTMLPurifier_Lexer::create($config) to your instantiated EntityParser object (this is the part I am least sure about whether it would work; you might have to create a Lexer patch with extends as well), then supply the altered Lexer to the config using Core.LexerImpl.

I have no working proof-of-concept of these steps for you right now (especially in the context of Laravel), but you should be able to go through those motions in the purifier.php file, before the return.

Upvotes: 1

Related Questions