user1606106
user1606106

Reputation: 207

Deleting lines with sed or awk

I have a file data.txt like this.

>1BN5.txt
207
208
211
>1B24.txt
88
92

I have a folder F1 that contains text files.

1BN5.txt file in F1 folder is shown below.

ATOM    421  CA  SER A 207      68.627 -29.819   8.533  1.00 50.79           C 
ATOM    421  CA  SER A 207      68.627 -29.819   8.533  1.00 50.79           C  
ATOM    422  C   SER A 248      70.124 -29.955   8.226  1.00 55.81           C 
ATOM    615  H   LEU B 208       3.361  -5.394  -6.021  1.00 10.00           H
ATOM    616  HA  LEU B 211       2.930  -4.494  -3.302  1.00 10.00           H 
ATOM    626  N   MET B  87       1.054  -3.071  -5.633  1.00 10.00           N  
ATOM    627  CA  MET B  87      -0.213  -2.354  -5.826  1.00 10.00           C 

1B24.txt file in F1 folder is shown below.

ATOM    630  CB  MET B  87      -0.476  -2.140  -7.318  1.00 10.00           C 
ATOM    631  CG  MET B  88      -0.828  -0.688  -7.575  1.00 10.00           C
ATOM    632  SD  MET B  88      -2.380  -0.156  -6.830  1.00 10.00           S
ATOM    643  N   ALA B  92      -1.541  -4.371  -5.366  1.00 10.00           N  
ATOM    644  CA  ALA B  94      -2.560  -5.149  -4.675  1.00 10.00           C

I need only the lines containing 207,208,211(6th column)in 1BN5.txt file. I want to delete other lines in 1BN5.txt file. Like this, I need only the lines containing 88,92 in 1B24.txt file.

Desired output

1BN5.txt file

ATOM    421  CA  SER A 207      68.627 -29.819   8.533  1.00 50.79           C
ATOM    421  CA  SER A 207      68.627 -29.819   8.533  1.00 50.79           C 
ATOM    615  H   LEU B 208       3.361  -5.394  -6.021  1.00 10.00           H  
ATOM    616  HA  LEU B 211       2.930  -4.494  -3.302  1.00 10.00           H

1B24.txt file

ATOM    631  CG  MET B  88      -0.828  -0.688  -7.575  1.00 10.00           C
ATOM    632  SD  MET B  88      -2.380  -0.156  -6.830  1.00 10.00           S
ATOM    643  N   ALA B  92      -1.541  -4.371  -5.366  1.00 10.00           N 

Upvotes: 4

Views: 365

Answers (7)

glenn jackman
glenn jackman

Reputation: 247230

This solution plays some tricks with the record separator: "data.txt" uses > as the record separator, while the other files use newline.

awk '
    BEGIN {RS=">"}
    FNR == 1 {
        # since the first char in data.txt is the record separator, 
        # there is an empty record before the real data starts
        next
    }
    {
        n = split($0, a, "\n")
        file = "F1/" a[1]
        newfile = file ".new"
        RS="\n"
        while (getline < file) {
            for (i=2; i<n; i++) {
                if ($6 == a[i]) {
                    print > newfile
                    break
                }
            }
        }
        RS=">"
        system(sprintf("mv \"%s\" \"%s.bak\" && mv \"%s\" \"%s\"", file, file, newfile, file))
    }
' data.txt 

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 204719

This will move all of the files in F1 to a tmp dir named "backup" and then re-create just the resultant non-empty files under F1

mv F1 backup &&
mkdir F1 &&
awk '
NF==FNR {
   if (sub(/>/,"")) {
      file=$0
      ARGV[ARGC++] = "backup/" file
   }
   else {
      tgt[file,$0] = "F1/" file
   }
   next
}
(FILENAME,$6) in tgt {
   print > tgt[FILENAME,$6]
}
' data.txt &&
rm -rf backup

If you want the empty files too it's a trivial tweak and if you want to keep the backup dir just get rid of the "&& rm.." at the end (do that during testing anyway).

EDIT: FYI this is one case where you could argue the case for getline not being completely incorrect since it's parsing a first file that's totally unlike the rest of the files in structure and intent so parsing that one file differently from the rest isn't going to cause any maintenance headaches later:

mv F1 backup &&
mkdir F1 &&
awk -v data="data.txt" '
BEGIN {
   while ( (getline line < data) > 0 ) {
      if (sub(/>/,"",line)) {
         file=line
         ARGV[ARGC++] = "backup/" file
      }
      else {
         tgt[file,line] = "F1/" file
      }
   }
}
(FILENAME,$6) in tgt {
   print > tgt[FILENAME,$6]
}
' &&
rm -rf backup

but as you can see it makes the script a bit more complicated (though slightly more efficient as there's now no test for FNR==NR in the main body).

Upvotes: 1

Steve
Steve

Reputation: 54592

Here's one way using GNU awk. Run like:

awk -f script.awk data.txt

Contents of script.awk:

/^>/ {
    file = substr($1,2)
    next
}

{
    a[file][$1]
}

END {

    for (i in a) {

        while ( ( getline line < ("./F1/" i) ) > 0 ) {

            split(line,b)

            for (j in a[i]) {

                if (b[6]==j) {

                    print line > "./F1/" i ".new"
                }
            }
        }

        system(sprintf("mv ./F1/%s.new ./F1/%s", i, i))
    }
}

Alternatively, here's the one-liner:

awk '/^>/ { file = substr($1,2); next } { a[file][$1] } END { for (i in a) { while ( ( getline line < ("./F1/" i) ) > 0 ) { split(line,b); for (j in a[i]) if (b[6]==j) print line > "./F1/" i ".new" } system(sprintf("mv ./F1/%s.new ./F1/%s", i, i)) } }' data.txt


If you have an older version of awk, older than GNU Awk 4.0.0, you could try the following. Run like:

awk -f script.awk data.txt

Contents of script.awk:

/^>/ {
    file = substr($1,2)
    next
}

{
    a[file]=( a[file] ? a[file] SUBSEP : "") $1
}

END {

    for (i in a) {

        split(a[i],b,SUBSEP)

        while ( ( getline line < ("./F1/" i) ) > 0 ) {

            split(line,c)

            for (j in b) {

                if (c[6]==b[j]) {

                    print line > "./F1/" i ".new"
                }
            }
        }

        system(sprintf("mv ./F1/%s.new ./F1/%s", i, i))
    }
}

Alternatively, here's the one-liner:

awk '/^>/ { file = substr($1,2); next } { a[file]=( a[file] ? a[file] SUBSEP : "") $1 } END { for (i in a) { split(a[i],b,SUBSEP); while ( ( getline line < ("./F1/" i) ) > 0 ) { split(line,c); for (j in b) if (c[6]==b[j]) print line > "./F1/" i ".new" } system(sprintf("mv ./F1/%s.new ./F1/%s", i, i)) } }' data.txt


Please note that this script does exactly as you describe. It expects files like 1BN5.txt and 1B24.txt to reside in the folder F1 in the present working directory. It will also overwrite your original files. If this is not the desired behavior, drop the system() call. HTH.

Results:

Contents of F1/1BN5.txt:

ATOM    421  CA  SER A 207      68.627 -29.819   8.533  1.00 50.79           C 
ATOM    421  CA  SER A 207      68.627 -29.819   8.533  1.00 50.79           C  
ATOM    615  H   LEU B 208       3.361  -5.394  -6.021  1.00 10.00           H
ATOM    616  HA  LEU B 211       2.930  -4.494  -3.302  1.00 10.00           H 

Contents of F1/1B24.txt:

ATOM    631  CG  MET B  88      -0.828  -0.688  -7.575  1.00 10.00           C
ATOM    632  SD  MET B  88      -2.380  -0.156  -6.830  1.00 10.00           S
ATOM    643  N   ALA B  92      -1.541  -4.371  -5.366  1.00 10.00           N

Upvotes: 5

Joseph Quinsey
Joseph Quinsey

Reputation: 9972

I don't think you can do this with just sed alone. You need a loop to read your file data.txt. For example, using a bash script:

#!/bin/bash

# First remove all possible "problematic" characters from data.txt, storing result
# in data.clean.txt. This removes everything except A-Z, a-z, 0-9, leading >, and ..
sed 's/[^A-Za-z0-9>\.]//g;s/\(.\)>/\1/g;/^$/d' data.txt >| data.clean.txt

# Next determine which lines to keep:
cat data.clean.txt | while read line; do
   if [[ "${line:0:1}" == ">" ]]; then
      # If input starts with ">", set remainder to be the current file
      file="${line:1}"
   else
      # If value is in sixth column, add "keep" to end of line
      # Columns assumed separated by one or more spaces
      # "+" is a GNU extension, so we need the -r switch
      sed -i -r "/^[^ ]+ +[^ ]+ +[^ ]+ +[^ ]+ +$line +/s/$/keep/" $file
   fi
done

# Finally delete the unwanted lines, i.e. those without "keep":
# (assumes each file appears only once in data.txt)
cat data.clean.txt | while read line; do
   if [[ "${line:0:1}" == ">" ]]; then
      sed -i -n "/keep/{s/keep//g;p;}" ${line:1}
   fi
done

Upvotes: 0

nullrevolution
nullrevolution

Reputation: 4137

assuming gnu awk, run this command from the directory containing data.txt:

awk -F">" '{if($2 != ""){fname=$2}if($2 == ""){term=$1;system("grep "term" F1/"fname" >>F1/"fname"_results");}}' data.txt

this parses data.txt for filenames and search terms, then calls grep from inside awk to append the matches from each file and term listed in data.txt to a new file in F1 called originalfilename.txt_results.

if you want to replace the original files completely, you could then run this command:

grep "^>.*$" data.txt | sed 's/>//' | xargs -I{} find F1 -name {}_results -exec mv F1/{}_results F1/{} \;

Upvotes: 1

Chris Seymour
Chris Seymour

Reputation: 85913

Definitely a job for awk:

$ awk '$6==207||$6==208||$6==211 { print }' 1bn5.txt
ATOM    421  CA  SER A 207      68.627 -29.819   8.533  1.00 50.79           C 
ATOM    421  CA  SER A 207      68.627 -29.819   8.533  1.00 50.79           C  
ATOM    615  H   LEU B 208       3.361  -5.394  -6.021  1.00 10.00           H
ATOM    616  HA  LEU B 211       2.930  -4.494  -3.302  1.00 10.00           H 

$ awk '$6==92||$6==88 { print }' 1B24.txt
ATOM    631  CG  MET B  88      -0.828  -0.688  -7.575  1.00 10.00           C
ATOM    632  SD  MET B  88      -2.380  -0.156  -6.830  1.00 10.00           S
ATOM    643  N   ALA B  92      -1.541  -4.371  -5.366  1.00 10.00           N 

Redirect to save the output:

$ awk '$6==207||$6==208||$6==211 { print }' 1bn5.txt > output.txt

Upvotes: 0

Sjoerd
Sjoerd

Reputation: 75689

Don't try to delete lines from the existing file, try to create a new file with only the lines you want to have:

cat 1bn5.txt | awk '$6 == 207 || $6 == 208 || $6 == 211 { print }' > output.txt

Upvotes: 1

Related Questions