Reputation: 1754
I have a question that I am at a loss to solve. I have 3 column tab-separated data, such as:
abs nmod+n+n-commitment-n 349.200023
abs nmod+n+n-a-commitment-n 333.306429
abs into+ns-j+vn-pass-rb-divide-v 295.57316
abs nmod+n+ns-commitment-n 182.085018
abs nmod+n+n-pledge-n 149.927391
abs nmod+n+ns-reagent-n 142.347358
I need to isolate the last two "elements" of the third column, in which my desired result would be a 4-column output that only contains those elements that end with "-n".
such as:
abs nmod+n+n commitment-n 349.200023
abs nmod+n+n-a commitment-n 333.306429
abs nmod+n+ns commitment-n 182.085018
abs nmod+n+n pledge-n 149.927391
abs nmod+n+ns reagent-n 142.347358
In this case, is there an awk
, grep
anything that can help? The files are approx. 500 MB, so they are not huge, but not small either.
Thanks for any insight.
Upvotes: 2
Views: 1969
Reputation: 123458
Using sed
:
sed -r -n '/-n\t[0-9.]*$/{s/(\S+)\t(.*)-([^-]+-\S+)\t(.*)/\1\t\2\t\3\t\4/p}' filename
For your input, it'd produce:
abs nmod+n+n commitment-n 349.200023
abs nmod+n+n-a commitment-n 333.306429
abs nmod+n+ns commitment-n 182.085018
abs nmod+n+n pledge-n 149.927391
abs nmod+n+ns reagent-n 142.347358
Upvotes: 1
Reputation: 195039
give this one-liner a try: (gawk)
awk -F'\t' -v OFS='\t' '$2~/-n$/{$2=gensub(/-([^-]*-n$)/,"\t\\1","g",$2);print}' file
output with your file (as f
):
kent$ awk -F'\t' -v OFS='\t' '$2~/-n$/{$2=gensub(/-([^-]*-n$)/,"\t\\1","g",$2);print}' f
abs nmod+n+n commitment-n 349.200023
abs nmod+n+n-a commitment-n 333.306429
abs nmod+n+ns commitment-n 182.085018
abs nmod+n+n pledge-n 149.927391
abs nmod+n+ns reagent-n 142.347358
Upvotes: 3
Reputation: 289515
With this you can check if the 2nd column ends with -n
and then print the lines:
$ awk '$2~/-n$/' file
abs nmod+n+n-commitment-n 349.200023
abs nmod+n+n-a-commitment-n 333.306429
abs nmod+n+ns-commitment-n 182.085018
abs nmod+n+n-pledge-n 149.927391
abs nmod+n+ns-reagent-n 142.347358
To have the second field splitted so that the last two elements are isolated, you can use:
awk 'BEGIN{OFS=FS="\t"}
$2~/-n$/ {
size=split($2,a,"-");
for (i=1; i<=size-2; i++) first=first"-"a[i];
second=a[size-1]"-"a[size];
print $1,first,second,$3;
first=second=""
}' file
which returns
$ awk 'BEGIN{OFS=FS="\t"} $2~/-n$/ {size=split($2,a,"-"); for (i=1; i<=size-2; i++) first=first"-"a[i]; second=a[size-1]"-"a[size]; print $1,first,second,$3; first=second=""}' file
abs -nmod+n+n commitment-n 349.200023
abs -nmod+n+n-a commitment-n 333.306429
abs -nmod+n+ns commitment-n 182.085018
abs -nmod+n+n pledge-n 149.927391
abs -nmod+n+ns reagent-n 142.347358
BEGIN{OFS=FS="\t"}
set tab as input an output field separator.$2~/-n$/ {}
match lines in which the 2nd field ends with "-n" and do the things within {}
.size=split($2,a,"-")
cut the 2nd field in pieces based on the -
delimiter and save them in the a[]
array. Store the size of the array in size
var.for (i=1; i<=size-2; i++) first=first"-"a[i]; second=a[size-1]"-"a[size]
save the data in two different blocks: first everything up to the 2nd last field; then, the two last fields.print $1,first,second,$3
print everything.first=second=""
unset the variables.Upvotes: 3