Animesh Pandey
Animesh Pandey

Reputation: 6018

Apply regex on columns using awk

I have the following string that I get:

 new Field("count").del("query_then_fetch");
 new Field("scan").del("query_then_fetch sorting on `_doc`");
 new Field("compress").del("no replacement, implemented at the codec level");
 new Field("compress_threshold").del("no replacement");
 new Field("filter").del("query");

I run the following script on command line where the regex matches the strings that are in double quotes.:

awk -F '.del' '{match($1, "\".*\"", a); match($2, "\".*\"", b)}END{print a[0]; print b[0]}'

expecting this king of output:

"count" "query_then_fetch"
"scan" "query_then_fetch sorting on `_doc`"
"compress" "no replacement, implemented at the codec level"
"compress_threshold" "no replacement"
"filter" "query"

but instead I get this output:

"filter"
"query"

How to resolve this issue?

Upvotes: 0

Views: 228

Answers (3)

Etan Reisner
Etan Reisner

Reputation: 80951

Your awk script is only printing once during the END block at the end of processing all the input.

At which point you are printing a[0] and b[0] on separate lines (because you are using two print statements).

What you want, with your current awk script, is to print a[0] and b[0] in a single printf statement while processing each line.

awk -F '.del' '{match($1, "\".*\"", a); match($2, "\".*\"", b); printf "%s %s\n",a[0], b[0]}' sample.csv

Alternatively you could use the much simpler awk script below which splits the input on ( and ) characters.

awk -F '[()]' '{print $2,$4}' sample.csv

Upvotes: 1

dawg
dawg

Reputation: 103884

Given:

$ echo "$tgt" 
 new Field("count").del("query_then_fetch");
 new Field("scan").del("query_then_fetch sorting on `_doc`");
 new Field("compress").del("no replacement, implemented at the codec level");
 new Field("compress_threshold").del("no replacement");
 new Field("filter").del("query");

You can do:

$ echo "$tgt" | awk  '{split($0, a, "\""); print a[2]"\t"a[4]}'
count   query_then_fetch
scan    query_then_fetch sorting on `_doc`
compress    no replacement, implemented at the codec level
compress_threshold  no replacement
filter  query

Add quotes around the fields as needed.

Or, you can do:

$ echo "$tgt" | awk  '{split($0, a, /[()]/); print a[2],a[4]}'
"count" "query_then_fetch"
"scan" "query_then_fetch sorting on `_doc`"
"compress" "no replacement, implemented at the codec level"
"compress_threshold" "no replacement"
"filter" "query"

Upvotes: 1

Haifeng Zhang
Haifeng Zhang

Reputation: 31895

cat sample.csv                                    
 new Field("count").del("query_then_fetch");
 new Field("scan").del("query_then_fetch sorting on `_doc`");
 new Field("compress").del("no replacement, implemented at the codec level");
 new Field("compress_threshold").del("no replacement");
 new Field("filter").del("query");

awk -F'"' -v q="\"" '{print q $2 q,q $4 q}' sample.csv  
"count" "query_then_fetch"
"scan" "query_then_fetch sorting on `_doc`"
"compress" "no replacement, implemented at the codec level"
"compress_threshold" "no replacement"
"filter" "query"

I am using double quotes as field separator and print out the 2nd and 4th fields

Upvotes: 1

Related Questions