Aishwarya Shiva
Aishwarya Shiva

Reputation: 3416

Regex to detect a pattern outside double quotes

I have a string like

FIND files where file2=29 AND file32="12" OR file623134="file23"

This text is entered by user to search his/her data. This is converted by the application into a SQL Query.

For example: FIND is replaced by SELECT and string with pattern file[number] (Example: file2, file32 and file623134, as shown in string above.) are converted like FILE_ID=[number] AND FILE_VALUE=[value of FILE[number]. The resultant SQL Query will be:

SELECT * FROM [FILES] WHERE (FILE_ID=2 AND FILE_VALUE=29) AND (FILE_ID=32 AND FILE_VALUE="12") OR (FILE_ID=623134 AND FILEVALUE="file23")

What I achieved so far, with the help of other SO questions, is to detect strings outside the double quotes using following regex:

(?<![\S"])([^"\s]+)(?![\S"])

It's working fine. But due to my lack of knowledge about regular expressions, I am unable to find a location in this regex where I can place the file[0-9] pattern. Please, tell me how can I achieve this?

And if possible please also tell me how to extract values from these patterns and replace them with corresponding values like file123=2 with (FILE_ID=123 AND FILE_VALUE=2).

Upvotes: 1

Views: 1051

Answers (4)

Nigel Thorne
Nigel Thorne

Reputation: 21558

Lets say we are matching FIND files where file2=29 AND file32="12" OR file623134="file23"

By way of explanation I'll do this in steps.

Obviously a regex that exactly matches the string would match.

FIND files where file2=29 AND file32="12" OR file623134="file23"

FIND files where file2=29 AND file32="12" OR file623134="file23"

First lets decide what bits we want to read from it... and make them accessable.

FIND (files) where file(2)=(29) AND file(32)=("12") OR file(623134)=("file23")

FIND (files) where file(2)=(29) AND file(32)=("12") OR file(623134)=("file23")

Here we stick brackets around all the bits that we want to read out. This defines those bits as "capture groups". In C# we can give them names. We will do that later.

Now... lets generalize this regex so it matches more examples.. the keys are numbers, so we can capture them with [0-9]+. This means match a character in the range 0 to 9, at least once.

FIND (files) where file([0-9]+)=(29) AND file([0-9]+)=("12") OR file([0-9]+)=("file23")

FIND (files) where file([0-9]+)=(29) AND file([0-9]+)=("12") OR file([0-9]+)=("file23")

Ok.. now the values... some here are strings.. lets match those...

a string is stuff that is not a " surrounded by '"'s or "[^"]+" (Note.. the plus means we can't match empty strings as we need at least one character. a * would let you match empty strings.)

FIND (files) where file([0-9]+)=(29) AND file([0-9]+)=("[^"]+") OR file([0-9]+)=("[^"]+")

FIND (files) where file([0-9]+)=(29) AND file([0-9]+)=("[^"]+") OR file([0-9]+)=("[^"]+")

One of the values in this example is a number.. so lets assumes they can be intergers.

FIND (files) where file([0-9]+)=([0-9]+) AND file([0-9]+)=("[^"]+") OR file([0-9]+)=("[^"]+")

FIND (files) where file([0-9]+)=([0-9]+) AND file([0-9]+)=("[^"]+") OR file([0-9]+)=("[^"]+")

Nothing makes the first example special.. so lets assume all values could be strings or integers. To make two options we use the | option matcher. (Now.. I guess you are yelling at the screen "No they can be anything... not just strings and numbers" but that's ok. I'll deal with that later too.)

FIND (files) where file([0-9]+)=("[^"]+"|[0-9]+) AND file([0-9]+)=("[^"]+"|[0-9]+) OR file([0-9]+)=("[^"]+"|[0-9]+)

FIND (files) where file([0-9]+)=("[^"]+"|[0-9]+) AND file([0-9]+)=("[^"]+"|[0-9]+) OR file([0-9]+)=("[^"]+"|[0-9]+)

Now... we have a fair bit of duplication here... the last parts are the same except one has "OR" and the other has "AND". This is significant.. we want to know what operator is being used... so lets capture that too.

FIND (files) where file([0-9]+)=("[^"]+"|[0-9]+) (AND) file([0-9]+)=("[^"]+"|[0-9]+) (OR) file([0-9]+)=("[^"]+"|[0-9]+)

FIND (files) where file([0-9]+)=("[^"]+"|[0-9]+) (AND) file([0-9]+)=("[^"]+"|[0-9]+) (OR) file([0-9]+)=("[^"]+"|[0-9]+)

Now we can factor out the duplication by removing the last part and saying it's a repeat of the previous key/value pair.

FIND (files) where file([0-9]+)=("[^"]+"|[0-9]+)( (AND|OR) file([0-9]+)=("[^"]+"|[0-9]+))*

FIND (files) where file([0-9]+)=("[^"]+"|[0-9]+)( (AND|OR) file([0-9]+)=("[^"]+"|[0-9]+))*

I've added a "*" as that last part of the expression could be repeated as many times as needed, or not be there at all.

Now... If we want to handle the value being anything, float, time, etc. we either need to include matches for each, or a general "anything" matcher. Both have downsides. If we match all types explicitly, we have more work to do. If we don't then we need to make some assumptions about "how do we know when the value is finished?"

Say we assume there will be white space after the value. Then we can match all characters until we hit whitespace... [^\s]+

FIND (files) where file([0-9]+)=([^\s]+)( (AND|OR) file([0-9]+)=([^\s]+))*

FIND (files) where file([0-9]+)=([^\s]+)( (AND|OR) file([0-9]+)=([^\s]+))*

But now.. if the value is a string, and it contains whitespace it breaks. We probably want to handle strings separately to fix this.

FIND (files) where file([0-9]+)=("[^"]+"|[^\s]+)( (AND|OR) file([0-9]+)=("[^"]+"|[^\s]+))*

FIND (files) where file([0-9]+)=("[^"]+"|[^\s]+)( (AND|OR) file([0-9]+)=("[^"]+"|[^\s]+))*

"[^"]+" doesn't handle escaped characters within your strings. A better matcher is "(\\"|[^"])+" which says quote, then either escaped quote or non-quote repeatedly, then quote. Using this would add a new capture group to your expression. we don't need that, so we can tell it not to capture this group by adding a ?: inside the brackets. eg "(?:\\"|[^"])+"

FIND (files) where file([0-9]+)=("(?:\\"|[^"])+"|[^\s]+)( (AND|OR) file([0-9]+)=("(?:\\"|[^"])+"|[^\s]+))*

FIND (files) where file([0-9]+)=("(?:\"|[^"])+"|[^\s]+)( (AND|OR) file([0-9]+)=("(?:\"|[^"])+"|[^\s]+))*

As I mentioned.. in C# you can name capture groups. You do this by adding a ?<name> inside the group.

FIND (?<table>files) where file(?<key>[0-9]+)=(?<value>"(?:\\"|[^"])+"|[^\s]+)( (?<operator>AND|OR) file(?<key>[0-9]+)=(?<value>"(?:\\"|[^"])+"|[^\s]+))*

There is still duplication in this expression.. but if we took it out, we would be allowing invalid expressions to match. eg.

FIND (?<table>files)( (?<operator>AND|OR|where) file(?<key>[0-9]+)=(?<value>"(?:\\"|[^"])+"|[^\s]+))+

This would allow FIND files AND file2="test" to match.. which isn't really want you want, but may be good enough.

I would probably just use string concat to remove the duplication,

var pair = @"(?<pair>file(?<key>[0-9]+)=(?<value>"(?:\\\"|[^\"])+\"|[^\s]+))";
var query = @"FIND (?<table>files) where "+pair+"( (?<operator>AND|OR) "+pair+")*";
var ex = new Regex(query);

or just put a code check the make sure the first operator is "where" FIND (files)( (AND|OR|where) file([0-9]+)=("(?:\\"|[^\"])+\"|[^\s]+))+

var query = @"FIND (?<table>files)(?<condition> (?<operator>AND|OR|where) file(?<key>[0-9]+)=(?<value>"(?:\\\"|[^\"])+\"|[^\s]+))+";
var ex = new Regex(query);
var match = ex.Match(...);
... match.Groups["table"].Value ... 

You can now match a string, loop though the "condition" groups and ask them for their operator,key, andvalue`.

see How do I access named capturing groups in a .NET Regex?

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627600

Here is another 2-step approach:

  • Get the key-value pairs with the IDs in them and replace using back-refrences
  • Replace the beginning part (a literal "FIND files where") with another literal "SELECT * FROM [FILES] WHERE".

C# demo:

var str = "FIND files where file2=29 AND file32=\"12\" OR file623134=\"file23\"";
var rx = new Regex(@"\bfile(\d+)=""?(\w+)""?");
var result = rx.Replace(str, "(FILE_ID=$1 AND FILE_VALUE=$2)")
              .Replace("FIND files where", "SELECT * FROM [FILES] WHERE");
Console.WriteLine(result);

Result:

SELECT * FROM [FILES] WHERE (FILE_ID=2 AND FILE_VALUE=29) AND (FILE_ID=32 AND FILE_VALUE=12) OR (FILE_ID=623134 AND FILE_VALUE=file23)

The regex breakdown:

  • \bfile - the literal file literal that is not preceded with a word character
  • (\d+) - 1 or more digits that are captured into Group 1
  • = - literal =
  • "? - 1 or 0 double quote
  • (\w+) - the second capturing group that consists of 1 or more alphanumeric symbols (a letter, a digit or underscore)
  • "? - 1 or 0 double quote

Upvotes: 2

shas
shas

Reputation: 703

like this

<div id="date">file23="125"</div>

js

var data =$('#date').text();
var arr = data.split('=');
var val1 =arr[0];
val1 =  val1.replace(/[0-9]/g,'');
var val2 =arr[0];
val2 =  val2.replace(/[a-zA-Z]/g,'');
var val = arr[1];
val = val.replace(/[&\/\\#,+()$~%.'":*?<>{}]/g,'');
$("#date").html("<span>"+val1 + "</span></br>" + "<span> id="+val2 + "</span></br>" + "<span> value="+val + "</span></br>" );     

output

file
id=23
value=125

jsfiddle click here

Upvotes: 1

baddger964
baddger964

Reputation: 1237

You can detect yours files string with :

file([0-9]+)=\"([0-9]+)\"

This regex return 3 strings, the entire match, the first number and the second number in the string.

I hope it's what you expect.

But i think you miss one point in regex use :

Place parentheses around multiple tokens to group them together. You can then apply a quantifier to the group. E.g. Set(Value)? matches Set or SetValue.

Parentheses create a capturing group. The above example has one group. After the match, group number one contains nothing if Set was matched. It contains Value if SetValue was matched. How to access the group's contents depends on the software or programming language you're using. Group zero always contains the entire regex match.

from : http://www.regular-expressions.info/quickstart.html

So you have to define a regex for the entire line and create a matching group for each substring you want to extract.

Upvotes: 2

Related Questions