kirill_l
kirill_l

Reputation: 645

Regular expression for removing white spaces but not those inside ""

I have the following input string:

key1 = "test string1" ; key2 = "test string 2"

I need to convert it to the following without tokenizing

key1="test string1";key2="test string 2"

Upvotes: 4

Views: 601

Answers (2)

przemoc
przemoc

Reputation: 3879

Using ERE, i.e. extended regular expressions (which are more clear than basic RE in such cases), assuming no quote escaping and having global flag (to replace all occurrences) you can do it this way:

s/ *([^ "]*) *("[^"]*")?/\1\2/g

sed:

$ echo 'key1 = "test string1" ; key2 = "test string 2"' | sed -r 's/ *([^ "]*) *("[^"]*")/\1\2/g'

C# code:

using System.Text.RegularExpressions;
Regex regex = new Regex(" *([^ \"]*) *(\"[^\"]*\")?");
String input = "key1 = \"test string1\" ; key2 = \"test string 2\"";
String output = regex.Replace(input, "$1$2");
Console.WriteLine(output);

Output:

key1="test string1";key2="test string 2"

Escape-aware version

On second thought I've reached a conclusion that not showing escape-aware version of regexp may lead to incorrect findings, so here it is:

s/ *([^ "]*) *("([^\\"]|\\.)*")?/\1\2/g

which in C# looks like:

Regex regex = new Regex(" *([^ \"]*) *(\"(?:[^\\\\\"]|\\\\.)*\")?");
String output = regex.Replace(input, "$1$2");

Please do not go blind from those backslashes!

Example

Input:  key1 = "test \\ " " string1" ; key2 = "test \" string 2"
Output: key1="test \\ "" string1";key2="test \" string 2"

Upvotes: 2

stusmith
stusmith

Reputation: 14113

You'd be far better off NOT using a regular expression.

What you should be doing is parsing the string. The problem you've described is a mini-language, since each point in that string has a state (eg "in a quoted string", "in the key part", "assignment").

For example, what happens when you decide you want to escape characters?

key1="this is a \"quoted\" string"

Move along the string character by character, maintaining and changing state as you go. Depending on the state, you can either emit or omit the character you've just read.

As a bonus, you'll get the ability to detect syntax errors.

Upvotes: 5

Related Questions