Sourabh
Sourabh

Reputation: 8482

Mongo RegEx - Match all types of space characters

\s regex wildcard doesn't match all types of space in mongodb (v4.0.3)

> db.test.insertOne({ "mail" : "special [email protected]" })
> db.test.insertOne({ "mail" : "normal [email protected]" })

> db.test.find({ mail: / / }, { _id: 0, mail: 1 })
{ "mail" : "special [email protected]" }
> db.test.find({ mail: /\s/ }, { _id: 0, mail: 1 })
{ "mail" : "normal [email protected]" }

The space in special [email protected] above is special space, and normal space in normal [email protected]

Is this expected, or a bug? Is there any way to make it match all spaces?

Sidenote: I am running regex inside $not so I can't use $regex


Edit: Even [^\S] doesn't match both strings

> db.test.find({ mail: /[^\S]/ }, { _id: 0, mail: 1 })
{ "mail" : "normal [email protected]" }

Does mongo regex only work with ASCII?

Upvotes: 2

Views: 1751

Answers (1)

Alex Blex
Alex Blex

Reputation: 37038

Mongo uses PCRE flavour https://docs.mongodb.com/manual/reference/operator/query/regex/#op._S_regex

https://www.pcre.org/original/doc/html/pcrepattern.html reads:

The default \s characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space (32), which are defined as white space in the "C" locale. This list may vary if locale-specific matching is taking place. For example, in some locales the "non-breaking space" character (\xA0) is recognized as white space, and in others the VT character is not.

You can replace \s with

[\s\x00a0\x1680\x2000\x2001\x2002\x2003\x2004\x2005\x2006
\x2007\x2008\x2009\x200a\x2028\x2029\x202f\x205f\x3000\xfeff]

(split for readability) for compatibility with ECMA regex flavour.

You may need to wrap codes into {} depending on shell/client e.g. \x{00a0}\x{1680} and so on.

For your query it would be:

db.test.find({ mail: /[\s\x{00a0}\x{1680}\x{2000}\x{2001}\x{2002}\x{2003}\x{2004}\x{2005}\x{2006}\x{2007}\x{2008}\x{2009}\x{200a}\x{2028}\x{2029}\x{202f}\x{205f}\x{3000}\x{feff}]/ }, { _id: 0, mail: 1 })

Upvotes: 4

Related Questions