Reputation: 372
I am using a html purifier package for purifying my rich text from any xss before storing in database.
But my rich text allows for Wiris symbols which uses special character as →
or  
.
Problem is the package does not allow me to escape these characters. It removes them completely. What should I do to escape them ??
Example of the string before purifying
<p><math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>x</mi><mn>2</mn></msup><mo> </mo><mo>+</mo><mo> </mo><mmultiscripts><mi>y</mi><mprescripts/><none/><mn>2</mn></mmultiscripts><mo> </mo><mover><mo>→</mo><mo>=</mo></mover><mo> </mo><msup><mi>z</mi><mn>2</mn></msup><mo> </mo></math></p>
After purifying
<p><math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>x</mi><mn>2</mn></msup><mo> </mo><mo>+</mo><mo> </mo><mmultiscripts><mi>y</mi><mprescripts></mprescripts><none><mn>2</mn></mmultiscripts><mo> </mo><mover><mo>→</mo><mo>=</mo></mover><mo> </mo><msup><mi>z</mi><mn>2</mn></msup><mo> </mo></math></p>
Upvotes: 1
Views: 570
Reputation: 372
I solved the problem by setting key Core.EscapeNonASCIICharacters
to true
under my default
key in my purifier.php
file and the problem has gone.
Upvotes: 1
Reputation: 6179
My guess is that these entities are failing the regexen that HTML Purifier is using to check for valid entities in HTMLPurifier_EntityParser
, here:
$this->_textEntitiesRegex =
'/&(?:'.
// hex
'[#]x([a-fA-F0-9]+);?|'.
// dec
'[#]0*(\d+);?|'.
// string (mandatory semicolon)
// NB: order matters: match semicolon preferentially
'([A-Za-z_:][A-Za-z0-9.\-_:]*);|'.
// string (optional semicolon)
"($semi_optional)".
')/';
$this->_attrEntitiesRegex =
'/&(?:'.
// hex
'[#]x([a-fA-F0-9]+);?|'.
// dec
'[#]0*(\d+);?|'.
// string (mandatory semicolon)
// NB: order matters: match semicolon preferentially
'([A-Za-z_:][A-Za-z0-9.\-_:]*);|'.
// string (optional semicolon)
// don't match if trailing is equals or alphanumeric (URL
// like)
"($semi_optional)(?![=;A-Za-z0-9])".
')/';
Notice how it expects numeric entities to start with 0
currently. (Perfectly sane since it's designed to handle pure HTML, without add-ons, and to make that safe; but in your use-case, you want more entity flexibility.)
You could extend that class and overwrite the constructor (where these regexen are being defined, by instead defining your own where you remove the 0*
from the // dec
part of the regexen), instantiating that, try setting $this->_entity_parser
on a Lexer created with HTMLPurifier_Lexer::create($config)
to your instantiated EntityParser object (this is the part I am least sure about whether it would work; you might have to create a Lexer patch with extends
as well), then supply the altered Lexer to the config using Core.LexerImpl
.
I have no working proof-of-concept of these steps for you right now (especially in the context of Laravel), but you should be able to go through those motions in the purifier.php
file, before the return
.
Upvotes: 1