user1413781
user1413781

Reputation: 31

Removing whitespace from Strings contained in double quotes bash script

I've been using sep to attempt this, basically I have a text file, which contains a reasonable amount of the same line e.g.

4444 username "some information" "someotherinformation" "even more information"

I need to replace the spaces inside the quotes with underscores so it looks like this

4444 username "some_information" "someotherinformation" "even_more_information"

currently I have been able to separate out the quoted information

sed 's/"\([^"]*\)"/_/g' myfile.txt

Advice on how to proceed?

Upvotes: 3

Views: 2404

Answers (4)

potong
potong

Reputation: 58430

This might work for you:

echo '4444 username "some information" "someotherinformation" "even more information"' |
sed 's/"[^"]*"/\n&/g;:a;s/\(\n"[^"]*\) /\1_/g;ta;s/\n//g'
4444 username "some_information" "someotherinformation" "even_more_information"
  • Add a marker (\n) to quoted strings. sed 's/"[^"]*"/\n&/g;
  • Replace all spaces in quoted strings by _. :a;s/\(\n"[^"]*\) /\1_/g;ta
  • Remove markers. s/\n//g

Upvotes: 1

Tim Pote
Tim Pote

Reputation: 28029

EDITED

The previous version would add unwanted spaces. This version does exactly what the OP wants.

This is probably the easiest way to get what you want.

awk -F'"' '
  BEGIN {
    OFS="\""
  }
  {
    for (i = 2; i < NF; i += 2) {
      gsub(/[ \t]+/, "_", $i)
    }

    print $0
  }
' file > outputFile

Upvotes: 3

Dennis Williamson
Dennis Williamson

Reputation: 360105

sed -r ':a; s/^((([^"]*"){2})*[^"]*"[^" ]*) /\1_/;ta'
4444 username "some_information" "someotherinformation" "even_more_information"

or

sed ':a; s/^\(\(\([^"]*"\)\{2\}\)*[^"]*"[^" ]*\) /\1_/;ta'
4444 username "some_information" "someotherinformation" "even_more_information"
  • :a - label "a" for the loop
  • s/// - perform a substitution
  • ^( - anchor the whole search string at the beginning of the line
  • (([^"]*"){2})* - capture (in group 1) two sets of zero or more non-quotes followed by a quote (zero or more times)
  • [^"]*" - followed by zero or more non-quotes followed by a quote
  • [^" ]* - followed by zero or more characters that are not spaces or quotes
  • ) - end the anchored sequence and look for a required space to replace
  • \1 - substitute the captured group and an underscore for the matched sequence
  • ta - branch (transfer execution) to label :a if a successful substitution has been done (continue to the next instruction if not - which, in this case is to end processing for this line and read the next, starting a new round of processing)

This finds the first space in the last quoted string that has any spaces and replaces it. Then the next, if any, until that quoted string is finished. And so on for any additional spaces.

Then the the next previous quoted string that contains a space...and so on.

This is what the pattern space looks like at each step through the :a ... ta loop:

4444 username "some information" "someotherinformation" "even_more information"

4444 username "some information" "someotherinformation" "even_more_information"

4444 username "some_information" "someotherinformation" "even_more_information"

Then it would step through a couple more times to look for any matches at the beginning of the line.

Upvotes: 6

zwol
zwol

Reputation: 140619

I'd actually do this in C, which makes it easier to do a character-by-character state machine than most higher-level languages.

#include <stdio.h>
int main(void)
{
    int inside_quotes = 0;
    int backslash = 0;
    int c;
    while ((c = getchar()) != EOF) {
        switch (c) {
        case ' ':
            if (inside_quotes)
                c = '_';
            break;
        case '"':
            if (!backslash)
                inside_quotes = !inside_quotes;
            break;
        case '\\':
            if (!backslash)
                backslash = 2;
            break;
        default:
            break;
        }
        if (backslash > 0) backslash--;
        putchar(c);
    }
    return 0;
}

Not tested or even compiled. Backslash handling, in particular, may very well be buggy.

Upvotes: 0

Related Questions