owwoow14
owwoow14

Reputation: 1754

awk/grep certain parts of a specific column

I have a question that I am at a loss to solve. I have 3 column tab-separated data, such as:

abs nmod+n+n-commitment-n   349.200023
abs nmod+n+n-a-commitment-n 333.306429
abs into+ns-j+vn-pass-rb-divide-v   295.57316
abs nmod+n+ns-commitment-n  182.085018
abs nmod+n+n-pledge-n   149.927391
abs nmod+n+ns-reagent-n 142.347358

I need to isolate the last two "elements" of the third column, in which my desired result would be a 4-column output that only contains those elements that end with "-n".

such as:

abs nmod+n+n   commitment-n   349.200023
abs nmod+n+n-a   commitment-n 333.306429
abs nmod+n+ns   commitment-n  182.085018
abs nmod+n+n   pledge-n   149.927391
abs nmod+n+ns   reagent-n 142.347358

In this case, is there an awk, grep anything that can help? The files are approx. 500 MB, so they are not huge, but not small either. Thanks for any insight.

Upvotes: 2

Views: 1969

Answers (3)

devnull
devnull

Reputation: 123458

Using sed:

sed -r -n '/-n\t[0-9.]*$/{s/(\S+)\t(.*)-([^-]+-\S+)\t(.*)/\1\t\2\t\3\t\4/p}' filename

For your input, it'd produce:

abs nmod+n+n    commitment-n    349.200023
abs nmod+n+n-a  commitment-n    333.306429
abs nmod+n+ns   commitment-n    182.085018
abs nmod+n+n    pledge-n    149.927391
abs nmod+n+ns   reagent-n   142.347358

Upvotes: 1

Kent
Kent

Reputation: 195039

give this one-liner a try: (gawk)

awk -F'\t' -v OFS='\t' '$2~/-n$/{$2=gensub(/-([^-]*-n$)/,"\t\\1","g",$2);print}' file

output with your file (as f):

kent$  awk -F'\t' -v OFS='\t' '$2~/-n$/{$2=gensub(/-([^-]*-n$)/,"\t\\1","g",$2);print}' f
abs     nmod+n+n        commitment-n    349.200023
abs     nmod+n+n-a      commitment-n    333.306429
abs     nmod+n+ns       commitment-n    182.085018
abs     nmod+n+n        pledge-n        149.927391
abs     nmod+n+ns       reagent-n       142.347358

Upvotes: 3

fedorqui
fedorqui

Reputation: 289515

With this you can check if the 2nd column ends with -n and then print the lines:

$ awk '$2~/-n$/' file
abs nmod+n+n-commitment-n   349.200023
abs nmod+n+n-a-commitment-n 333.306429
abs nmod+n+ns-commitment-n  182.085018
abs nmod+n+n-pledge-n   149.927391
abs nmod+n+ns-reagent-n 142.347358

To have the second field splitted so that the last two elements are isolated, you can use:

awk 'BEGIN{OFS=FS="\t"}
     $2~/-n$/ {
               size=split($2,a,"-");
               for (i=1; i<=size-2; i++) first=first"-"a[i];
               second=a[size-1]"-"a[size];
               print $1,first,second,$3;
               first=second=""
              }' file

which returns

$ awk 'BEGIN{OFS=FS="\t"} $2~/-n$/ {size=split($2,a,"-"); for (i=1; i<=size-2; i++) first=first"-"a[i]; second=a[size-1]"-"a[size]; print $1,first,second,$3; first=second=""}' file
abs     -nmod+n+n       commitment-n    349.200023
abs     -nmod+n+n-a     commitment-n    333.306429
abs     -nmod+n+ns      commitment-n    182.085018
abs     -nmod+n+n       pledge-n        149.927391
abs     -nmod+n+ns      reagent-n       142.347358

Explanation

  • BEGIN{OFS=FS="\t"} set tab as input an output field separator.
  • $2~/-n$/ {} match lines in which the 2nd field ends with "-n" and do the things within {}.
  • size=split($2,a,"-") cut the 2nd field in pieces based on the - delimiter and save them in the a[] array. Store the size of the array in size var.
  • for (i=1; i<=size-2; i++) first=first"-"a[i]; second=a[size-1]"-"a[size] save the data in two different blocks: first everything up to the 2nd last field; then, the two last fields.
  • print $1,first,second,$3 print everything.
  • first=second="" unset the variables.

Upvotes: 3

Related Questions