Reputation: 207
I have a file data.txt like this.
>1BN5.txt
207
208
211
>1B24.txt
88
92
I have a folder F1 that contains text files.
1BN5.txt file in F1 folder is shown below.
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 422 C SER A 248 70.124 -29.955 8.226 1.00 55.81 C
ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H
ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H
ATOM 626 N MET B 87 1.054 -3.071 -5.633 1.00 10.00 N
ATOM 627 CA MET B 87 -0.213 -2.354 -5.826 1.00 10.00 C
1B24.txt file in F1 folder is shown below.
ATOM 630 CB MET B 87 -0.476 -2.140 -7.318 1.00 10.00 C
ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C
ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S
ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N
ATOM 644 CA ALA B 94 -2.560 -5.149 -4.675 1.00 10.00 C
I need only the lines containing 207,208,211(6th column)in 1BN5.txt file. I want to delete other lines in 1BN5.txt file. Like this, I need only the lines containing 88,92 in 1B24.txt file.
Desired output
1BN5.txt file
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H
ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H
1B24.txt file
ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C
ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S
ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N
Upvotes: 4
Views: 365
Reputation: 247230
This solution plays some tricks with the record separator: "data.txt" uses > as the record separator, while the other files use newline.
awk '
BEGIN {RS=">"}
FNR == 1 {
# since the first char in data.txt is the record separator,
# there is an empty record before the real data starts
next
}
{
n = split($0, a, "\n")
file = "F1/" a[1]
newfile = file ".new"
RS="\n"
while (getline < file) {
for (i=2; i<n; i++) {
if ($6 == a[i]) {
print > newfile
break
}
}
}
RS=">"
system(sprintf("mv \"%s\" \"%s.bak\" && mv \"%s\" \"%s\"", file, file, newfile, file))
}
' data.txt
Upvotes: 1
Reputation: 204719
This will move all of the files in F1 to a tmp dir named "backup" and then re-create just the resultant non-empty files under F1
mv F1 backup &&
mkdir F1 &&
awk '
NF==FNR {
if (sub(/>/,"")) {
file=$0
ARGV[ARGC++] = "backup/" file
}
else {
tgt[file,$0] = "F1/" file
}
next
}
(FILENAME,$6) in tgt {
print > tgt[FILENAME,$6]
}
' data.txt &&
rm -rf backup
If you want the empty files too it's a trivial tweak and if you want to keep the backup dir just get rid of the "&& rm.."
at the end (do that during testing anyway).
EDIT: FYI this is one case where you could argue the case for getline not being completely incorrect since it's parsing a first file that's totally unlike the rest of the files in structure and intent so parsing that one file differently from the rest isn't going to cause any maintenance headaches later:
mv F1 backup &&
mkdir F1 &&
awk -v data="data.txt" '
BEGIN {
while ( (getline line < data) > 0 ) {
if (sub(/>/,"",line)) {
file=line
ARGV[ARGC++] = "backup/" file
}
else {
tgt[file,line] = "F1/" file
}
}
}
(FILENAME,$6) in tgt {
print > tgt[FILENAME,$6]
}
' &&
rm -rf backup
but as you can see it makes the script a bit more complicated (though slightly more efficient as there's now no test for FNR==NR in the main body).
Upvotes: 1
Reputation: 54592
Here's one way using GNU awk
. Run like:
awk -f script.awk data.txt
Contents of script.awk
:
/^>/ {
file = substr($1,2)
next
}
{
a[file][$1]
}
END {
for (i in a) {
while ( ( getline line < ("./F1/" i) ) > 0 ) {
split(line,b)
for (j in a[i]) {
if (b[6]==j) {
print line > "./F1/" i ".new"
}
}
}
system(sprintf("mv ./F1/%s.new ./F1/%s", i, i))
}
}
Alternatively, here's the one-liner:
awk '/^>/ { file = substr($1,2); next } { a[file][$1] } END { for (i in a) { while ( ( getline line < ("./F1/" i) ) > 0 ) { split(line,b); for (j in a[i]) if (b[6]==j) print line > "./F1/" i ".new" } system(sprintf("mv ./F1/%s.new ./F1/%s", i, i)) } }' data.txt
If you have an older version of awk
, older than GNU Awk 4.0.0
, you could try the following. Run like:
awk -f script.awk data.txt
Contents of script.awk:
/^>/ {
file = substr($1,2)
next
}
{
a[file]=( a[file] ? a[file] SUBSEP : "") $1
}
END {
for (i in a) {
split(a[i],b,SUBSEP)
while ( ( getline line < ("./F1/" i) ) > 0 ) {
split(line,c)
for (j in b) {
if (c[6]==b[j]) {
print line > "./F1/" i ".new"
}
}
}
system(sprintf("mv ./F1/%s.new ./F1/%s", i, i))
}
}
Alternatively, here's the one-liner:
awk '/^>/ { file = substr($1,2); next } { a[file]=( a[file] ? a[file] SUBSEP : "") $1 } END { for (i in a) { split(a[i],b,SUBSEP); while ( ( getline line < ("./F1/" i) ) > 0 ) { split(line,c); for (j in b) if (c[6]==b[j]) print line > "./F1/" i ".new" } system(sprintf("mv ./F1/%s.new ./F1/%s", i, i)) } }' data.txt
Please note that this script does exactly as you describe. It expects files like 1BN5.txt
and 1B24.txt
to reside in the folder F1
in the present working directory. It will also overwrite your original files. If this is not the desired behavior, drop the system()
call. HTH.
Results:
Contents of F1/1BN5.txt
:
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H
ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H
Contents of F1/1B24.txt
:
ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C
ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S
ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N
Upvotes: 5
Reputation: 9972
I don't think you can do this with just sed
alone. You need a loop to read your file data.txt. For example, using a bash
script:
#!/bin/bash
# First remove all possible "problematic" characters from data.txt, storing result
# in data.clean.txt. This removes everything except A-Z, a-z, 0-9, leading >, and ..
sed 's/[^A-Za-z0-9>\.]//g;s/\(.\)>/\1/g;/^$/d' data.txt >| data.clean.txt
# Next determine which lines to keep:
cat data.clean.txt | while read line; do
if [[ "${line:0:1}" == ">" ]]; then
# If input starts with ">", set remainder to be the current file
file="${line:1}"
else
# If value is in sixth column, add "keep" to end of line
# Columns assumed separated by one or more spaces
# "+" is a GNU extension, so we need the -r switch
sed -i -r "/^[^ ]+ +[^ ]+ +[^ ]+ +[^ ]+ +$line +/s/$/keep/" $file
fi
done
# Finally delete the unwanted lines, i.e. those without "keep":
# (assumes each file appears only once in data.txt)
cat data.clean.txt | while read line; do
if [[ "${line:0:1}" == ">" ]]; then
sed -i -n "/keep/{s/keep//g;p;}" ${line:1}
fi
done
Upvotes: 0
Reputation: 4137
assuming gnu awk, run this command from the directory containing data.txt
:
awk -F">" '{if($2 != ""){fname=$2}if($2 == ""){term=$1;system("grep "term" F1/"fname" >>F1/"fname"_results");}}' data.txt
this parses data.txt
for filenames and search terms, then calls grep
from inside awk
to append the matches from each file and term listed in data.txt
to a new file in F1
called originalfilename.txt_results
.
if you want to replace the original files completely, you could then run this command:
grep "^>.*$" data.txt | sed 's/>//' | xargs -I{} find F1 -name {}_results -exec mv F1/{}_results F1/{} \;
Upvotes: 1
Reputation: 85913
Definitely a job for awk
:
$ awk '$6==207||$6==208||$6==211 { print }' 1bn5.txt
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H
ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H
$ awk '$6==92||$6==88 { print }' 1B24.txt
ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C
ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S
ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N
Redirect to save the output:
$ awk '$6==207||$6==208||$6==211 { print }' 1bn5.txt > output.txt
Upvotes: 0
Reputation: 75689
Don't try to delete lines from the existing file, try to create a new file with only the lines you want to have:
cat 1bn5.txt | awk '$6 == 207 || $6 == 208 || $6 == 211 { print }' > output.txt
Upvotes: 1