Reputation: 125
I have a file which has several lines with alphanumeric strings like ZINC123345667_123 followed by other lines. Now, I need to remove the digits after the delimiter "_" in only lines with strings containing "ZINC" and rest of the other lines remaining the same. I have tried using the below awk command, but obtained only the lines with "ZINC" and not with other lines.
My original Data:
Name: ZINC00000036_1
Grid Score: -23.170839
Grid_vdw: -22.304409
Grid_es: -0.866430
Int_energy: 4.932559
@<TRIPOS>MOLECULE
ZINC00000036_1
18 18 1 0 0
Name: ZINC00000053_3
Grid Score: -23.739523
Grid_vdw: -22.876204
Grid_es: -0.863320
Int_energy: 9.981080
@<TRIPOS>MOLECULE
ZINC00000053_3
20 20 1 0 0
Name: ZINC00000351_12
Grid Score: -30.763229
Grid_vdw: -27.735493
Grid_es: -3.027738
Int_energy: 4.097543
@<TRIPOS>MOLECULE
ZINC00000351_12
31 31 1 0 0
I have executed the below awk script
awk -F'_' '/ZINC/ {print $1}' data.file > out.file
Output obtained:
Name: ZINC00000036
ZINC00000036
Name: ZINC00000053
ZINC00000053
Name: ZINC00000351
ZINC00000351
But, I need the other lines too in the output file as below:
Name: ZINC00000036
Grid Score: -23.170839
Grid_vdw: -22.304409
Grid_es: -0.866430
Int_energy: 4.932559
@<TRIPOS>MOLECULE ZINC00000036 18 18 1 0 0
Name: ZINC00000053
Grid Score: -23.739523
Grid_vdw: -22.876204
Grid_es: -0.863320
Int_energy: 9.981080
@<TRIPOS>MOLECULE ZINC00000053 20 20 1 0 0
Name: ZINC00000351
Grid Score: -30.763229
Grid_vdw: -27.735493
Grid_es: -3.027738
Int_energy: 4.097543
@<TRIPOS>MOLECULE ZINC00000351 31 31 1 0 0
As my data file is huge and transforming it would be impossible, I will greatly appreciate any help with awk.
Upvotes: 4
Views: 4907
Reputation: 264
Another format of answer using sed,
sed 's/\(ZINC[0-9]*\)\(_.*\)/\1/g' inputfile
Replacing the entire string with the first half of the pattern. Rest all other lines will be displayed
Upvotes: 0
Reputation: 74605
I don't think that awk is the right tool for this job. A simple sed command will do it:
sed 's/\(ZINC[0-9]\{1,\}\)_[0-9]\{1,\}/\1/' file # most portable
sed 's/\(ZINC[0-9]\+\)_[0-9]\+/\1/' file # GNU sed
sed -E 's/(ZINC[0-9]+)_[0-9]+/\1/' file # extended regex mode
Capture the part before the underscore (ZINC, followed by some digits) and discard the rest.
Same thing in Perl, which is marginally shorter due to the digit character class \d
:
perl -pe 's/(ZINC\d+)_\d+/$1/' file
Come to think of it, if you're determined to use awk, this would work:
awk -F_ '/ZINC/{$0=$1}1' file
When ZINC
is matched, overwrite the line with the contents of the first field. The 1
at the end ensures that every line is printed.
Upvotes: 1
Reputation: 207455
I would tackle this with sed
:
sed -E '/ZINC[0-9]+_/s/_.*//' yourfile
That says... on any lines that contains "ZINC" followed by some digits then an underscore, substitute (i.e. replace) the underscore and anything else on the line with nothing in yourfile
.
If you add -i
after the sed
command, it allows in-place editing without your needing to create a second file.
Upvotes: 1
Reputation: 33327
To keep only the part before the first underscore character _
on lines containing ZINC
, and leave the other lines in tact, you can do:
awk -F'_' '/ZINC/{print $1;next}1' file
Upvotes: 0