Reputation: 125

Awk script to match a pattern and then remove entire line after delimiter

I have a file which has several lines with alphanumeric strings like ZINC123345667_123 followed by other lines. Now, I need to remove the digits after the delimiter "_" in only lines with strings containing "ZINC" and rest of the other lines remaining the same. I have tried using the below awk command, but obtained only the lines with "ZINC" and not with other lines.

My original Data:

 Name:      ZINC00000036_1
 Grid Score:          -23.170839
 Grid_vdw:          -22.304409
 Grid_es:           -0.866430
 Int_energy:            4.932559

@<TRIPOS>MOLECULE
ZINC00000036_1
 18 18 1 0 0

Name:       ZINC00000053_3
 Grid Score:          -23.739523
 Grid_vdw:          -22.876204
 Grid_es:           -0.863320
 Int_energy:            9.981080

@<TRIPOS>MOLECULE
ZINC00000053_3
 20 20 1 0 0

 Name:      ZINC00000351_12
 Grid Score:          -30.763229
 Grid_vdw:          -27.735493
 Grid_es:           -3.027738
 Int_energy:            4.097543

@<TRIPOS>MOLECULE
ZINC00000351_12
 31 31 1 0 0

I have executed the below awk script

awk -F'_' '/ZINC/ {print $1}' data.file > out.file

Output obtained:

Name:       ZINC00000036
ZINC00000036
Name:       ZINC00000053
ZINC00000053
Name:       ZINC00000351
ZINC00000351

But, I need the other lines too in the output file as below:

 Name:      ZINC00000036
 Grid Score:          -23.170839
 Grid_vdw:          -22.304409
 Grid_es:           -0.866430
 Int_energy:            4.932559

@<TRIPOS>MOLECULE ZINC00000036  18 18 1 0 0

 Name:      ZINC00000053
 Grid Score:          -23.739523
 Grid_vdw:          -22.876204
 Grid_es:           -0.863320
 Int_energy:            9.981080

@<TRIPOS>MOLECULE ZINC00000053  20 20 1 0 0

 Name:      ZINC00000351
 Grid Score:          -30.763229
 Grid_vdw:          -27.735493
 Grid_es:           -3.027738
 Int_energy:            4.097543

@<TRIPOS>MOLECULE ZINC00000351  31 31 1 0 0

As my data file is huge and transforming it would be impossible, I will greatly appreciate any help with awk.

Upvotes: 4

Answers (5)

Sadhun

Reputation: 264

Another format of answer using sed,

sed 's/\(ZINC[0-9]*\)\(_.*\)/\1/g' inputfile

Replacing the entire string with the first half of the pattern. Rest all other lines will be displayed

Upvotes: 0

Ed Morton

Reputation: 203463

sed '/ZINC/s/_.*//' file
awk '/ZINC/{sub(/_.*/,"")}1' file

Upvotes: 2

Tom Fenech

Reputation: 74605

I don't think that awk is the right tool for this job. A simple sed command will do it:

sed 's/\(ZINC[0-9]\{1,\}\)_[0-9]\{1,\}/\1/' file  # most portable
sed 's/\(ZINC[0-9]\+\)_[0-9]\+/\1/' file          # GNU sed
sed -E 's/(ZINC[0-9]+)_[0-9]+/\1/' file           # extended regex mode

Capture the part before the underscore (ZINC, followed by some digits) and discard the rest.

Same thing in Perl, which is marginally shorter due to the digit character class \d:

perl -pe 's/(ZINC\d+)_\d+/$1/' file

Come to think of it, if you're determined to use awk, this would work:

awk -F_ '/ZINC/{$0=$1}1' file

When ZINC is matched, overwrite the line with the contents of the first field. The 1 at the end ensures that every line is printed.

Upvotes: 1

Mark Setchell

Reputation: 207455

I would tackle this with sed:

sed -E '/ZINC[0-9]+_/s/_.*//' yourfile

That says... on any lines that contains "ZINC" followed by some digits then an underscore, substitute (i.e. replace) the underscore and anything else on the line with nothing in yourfile.

If you add -i after the sed command, it allows in-place editing without your needing to create a second file.

Upvotes: 1

user000001

Reputation: 33327

To keep only the part before the first underscore character _ on lines containing ZINC, and leave the other lines in tact, you can do:

awk -F'_' '/ZINC/{print $1;next}1' file

Upvotes: 0

Awk script to match a pattern and then remove entire line after delimiter

Answers (5)

Related Questions