Reputation: 914

Regex for Gmail-like search

I've been trying to figure out a regex for Gmail-like search, i.e.:

name:Joe surname:(Foo Bar)

...like in this topic. But with a slight difference: if there is a text without a key:, it's also split, so:

foo:(hello world) bar:(-{bad things}) some text to search

would return:

foo:(hello world)
bar:(-{bad things})
some text to search

Upvotes: 5

Answers (7)

ohaal

Reputation: 5268

There's no way to grab everything you need using a single regex. The problem is that there is not a reliable way of grabbing the non-keyworded text.

However, if we first grab and store all the keyworded text, then do a regex replace (using the same regular expression) with an empty string, we suddenly get the search string by itself!

Grab the keywords and related text using the following regex (see it on RegExr):
```
([a-zA-Z]+:(?:\([^)]+?\)|[^( ]+))
```
Then do a regex replace, with the same regex, on the full search string using an empty string. The resulting string will be the non-keyworded search text. Something along the lines of:
```
Regex.Replace(searchtext, @"[a-zA-Z]+:(?:\([^)]+?\)|[^( ]+)", "");
```
Perform whitespace trimming at start and end of the search text
Remove double (or more spaces) from search text (can be done with regex replace, replacing with a single space):
```
Regex.Replace(searchtext, @" {2,}", " ");
                            ^-- notice the space :)
```
????
PROFIT!!!

It is entirely possible to perform whitespace removal in the regex in #2, but when dealing with regexes, I tend to prefer keeping it as clean as possible.

Upvotes: 4

Dave F

Reputation: 370

You don't need to solve this problem using only one regular expression. You can re-use the answer that you linked to that you indicated would partially work.

The last array element is the only one that needs to be corrected.

Using your example you'd initially get:

[
    "foo:(hello world)",
    "bar:(-{bad things}) some text to search"
]

The last item needs to be split into text up to and including the first closing bracket and text following it. You'd then replace the last item with the text up to and including the bracket and then you'd append the text following it to the array.

[
    "foo:(hello world)",
    "bar:(-{bad things})",
    "some text to search"
]

The following pseudo code should explain how this can be done:

array; // Array returned when string was split using /\s+(?=\w+:)/
lastPosition = array.length-1;

lastElem = array[lastPosition]; // May contain text without a key

// Key is followed by an opening bracket
//  (check for opening bracket after semi-colon following key)
if ( lastElem.match( /^[^:]*:(/ ) ) {
    // Need to replace array entry with key and all text up to and including
    // closing bracket.
    // Additional text needs to be added to array.

    maxSplitsAllowed = 1;
    results = lastElem.split( /)\w*/ , maxSplitsAllowed );
    // White space following the bracket was included in the match so it
    //  wouldn't be at the front of the text without a key

    lastKeyAndText = results[0] + ')'; // Re-append closing bracket
    endingTextWithoutKey = results[1];

    array[lastPosition] = lastKeyAndText; // Correct array entry for last key
    array.append( endingTextWithoutKey ); // Append text without key

// Key is not followed by a closing bracket but has text without a key
//  (check for white space following characters that aren't white space
//   characters)
} else if (lastElem.match( /^[^:]*:[^\w]*\w/ )) {
    // Need to change array entry so that all text before first space
    // becomes the key.
    // Additional text needs to be added to array.

    maxSplitsAllowed = 1;
    results = lastElem.split( /\w+/ , maxSplitsAllowed );

    lastKeyAndText = results[0];
    endingTextWithoutKey = results[1];

    array[lastPosition] = lastKeyAndText; // Correct array entry for last key
    array.append( endingTextWithoutKey ); // Append text without key
}

I assumed that brackets are required when white space characters are to be included within text that follows a key.

Upvotes: 0

Chris S

Reputation: 65426

The problem you face with going the Regex route is you run into issues with spaces. There is probably a really complicated Regex to do this, but for a simple regex you'll find your searches can't contain spaces for keywords, e.g.:

Works: site:mysite user:john
Fails: site:"my awesome site" user:john

This will fail because it is tokenizing based on spaces. So if space support is a requirement read on...

I would recommend either using the Lucene .NET engine's inbuilt parser to give you the tokens, or using a grammar and a parser such as GoldParser, Irony or Antlr.

It might sound too long and complicated for what you want, but having written a grammar for GoldParser to do exactly what you're doing, it is actually quite easy once the grammar is done. Here's an example of the grammar:

"Name"     = 'Spruce Search Grammar'
"Version"  = '1.1'
"About"    = 'The search grammar for Spruce TFS MVC frontend'

"Start Symbol" = <Query>

! -------------------------------------------------
! Character Sets
! -------------------------------------------------
{Valid} = {All Valid} - ['-'] - ['OR'] - {Whitespace} - [':'] - ["] - ['']
{Quoted} = {All Valid} - ["] - ['']

! -------------------------------------------------
! Terminals
! -------------------------------------------------
AnyChar    = {Valid}+
Or = 'OR'
Negate = ['-']
StringLiteral   = '' {Quoted}+ '' | '"' {Quoted}+ '"'

! -- Field-specific terms
Project     = 'project' ':'
...
CreatedOn   = 'created-on' ':'
ResolvedOn  = 'resolved-on' ':'
! -------------------------------------------------
! Rules
! -------------------------------------------------

! The grammar starts below
<Query> ::= <Query> <Keywords> | <Keywords>
<SingleWord> ::= AnyChar

<Keywords> ::= <SingleWord>
              | <QuotedString> 
              | <Or> 
              | <Negate> 
              | <FieldTerms>

<Or> ::= <Or> <SingleWord> 
        | Or Negate
        | Or <SingleWord>
        | Or <QuotedString>

<Negate> ::= <Negate> Negate <SingleWord>
            | <Negate> Negate <QuotedString>
            | Negate <SingleWord>
            | Negate <QuotedString>

<QuotedString> ::= StringLiteral

<FieldTerms> ::= <FieldTerms> Project | <FieldTerms> Description | <FieldTerms> State 
                | <FieldTerms> Type | <FieldTerms> Area | <FieldTerms> Iteration 
                | <FieldTerms> AssignedTo | <FieldTerms> ResolvedBy 
                | <FieldTerms> ResolvedOn | <FieldTerms> CreatedOn
                | Project 
                | <Description>
                | State 
                | Type 
                | Area 
                | Iteration 
                | CreatedBy
                | AssignedTo 
                | ResolvedBy
                | CreatedOn
                | ResolvedOn

<Description> ::= <Description> Description | <Description> Description StringLiteral
                | Description | Description StringLiteral

This gives you search support for something like:

resolved-by:john project:"amazing tfs project"

If you look at the Keywords token, you can see it is expecting a singleword, an OR, a quoted string, or a negative (a NOT). The hard part comes when this definition becomes recursive, which you see in the <Description> part.

The syntax is called EBNF which describes the format of your language. You can write something as simple as a search query parser in it, or an entire computer language. The way Goldparser parses the tokens will restrict you, as it looks ahead for tokens (LALR), so languages such as HTML and Wiki syntax will break any grammar you attempt to write, as these formats don't force you to close tags/tokens. Antlr gives you LL(*) which is more forgiving of missing start tags/tokens but isn't something you don't need to worry about for a search query parser.

The code folder for my grammar and C# code can be found in this project.

QueryParser is the class that parses the search string, the grammar file is the .grm file, the 2mb file is how Goldparser optimises your grammar to basically create its own table of possibilities. Calitha is the C# library for GoldParser, and is easy enough to implement. Without writing an even larger answer it's hard to describe exactly how it's done, but it's fairly straightforward once you have compiled the grammar, and Goldparser has a very intuitive IDE for writing grammars with and a huge set of existing ones such as SQL, C#, Java and even a Perl regex I believe.

It's not a 1 hour quick fix as you'd get from a regex though, closer to 2-3 days, however you do learn the 'proper' way of parsing.

Upvotes: 4

Kobi

Reputation: 137997

A simple approach here is to match the string with this pattern:

\w+:(?:\([^)]*\)|\S+)|\S+

That would match:

\w+: - a key.
(?:) - followed by...
- \([^)]*\) - parentheses
- | - or
- \S+ - some characters that are not space.
|\S+ - or just match a single word.

Note that this pattern breaks words into different matches. If you really can't handle that, you may use something like |(?:\S+(\s+(?!\w*:)[^\s:]+)*) instead of the last |\S+.

Working example: http://ideone.com/bExFd

Upvotes: 0

Kobi

Reputation: 137997

Another option, a little more robust:
Here we can use a somewhat advanced feature of .Net patterns - they keep all captures of all groups. This is a useful feature to build a full parser. Here, I've included some other search features like quoted string and operators (OR or range .., for example):

\A
(?>
    \s                      # skip over spaces.
    |
    (?<Key>\w+):            # Key:
    (?:                     # followed by:
        \(                     
        (?<KeyValue>[^)]*)      # Parentheses
        \)
        |                       # or
        (?<KeyValue>\S+)        # a single word
    )
    |
    (?<Operator>OR|AND|-|\+|\.\.)
    |
    ""(?<Term>[^""]*)""     # quoted term
    |
    (?<Term>\w+)            # just a word
    |
    (?<Invalid>.)           # Any other character isn't valid
)*
\z

You can now easily get all tokens and their positions (you can also zip the Key and KeyValue captures to pair them):

Regex queryParser = new Regex(pattern, RegexOptions.IgnorePatternWhitespace);
Match m = queryParser.Match(query); // single match!
// ...
var terms = m.Groups["Term"].Captures;

Working example: http://ideone.com/B7tln

Upvotes: 0

ilomambo

Reputation: 8350

This might work for you

In Java:

p = Pattern.compile("(\\w+:(\\(.*?\\))|.+)\\s*");
m = p.matcher("foo:(hello world) bar:(-{bad things}) some text to search");
while(m.find()){
    Log.v("REGEX", m.group(1));
}

Produces:

05-25 15:21:06.242: V/REGEX(18203): foo:(hello world)
05-25 15:21:08.061: V/REGEX(18203): bar:(-{bad things})
05-25 15:21:09.761: V/REGEX(18203): some text to search

The regex works as long as the tags are first and the free text is last.
Even for tags you can get the content with m.group(2)

Upvotes: 0

Neil_M

Reputation: 482

You may like to take a look at this question.

It contains the following Regex sample:

^((?!hede).)*$

As the author of the answers states "The regex above will match any string, or line without a line break, not containing the (sub) string 'hede'. "

Therefore you should be able to combine this with the information from the topic you posted and the above piece of Regex to solve your problem.

Hope this helps!!!

Upvotes: 0

Regex for Gmail-like search

Answers (7)

Related Questions