Reputation: 509
I have a file like this:
1 4014 1.97676 1 1
1 4014 1.97676 2 1
1 4014 1.97676 3 1
1 2014 1.97676 4 1
1 2014 1.97676 5 1
1 401 1.97676 6 1
1 401 1.97676 7 1
1 401 1.97676 8 1
1 14 1.97676 9 1
1 14 1.97676 10 1
I want to trim this file: Remove rows with values in the 2nd column < 1000. After trimming, the file should look like this:
1 4014 1.97676 1 1
1 4014 1.97676 2 1
1 4014 1.97676 3 1
1 2014 1.97676 4 1
1 2014 1.97676 5 1
How to achieve this in bash? I don't want to do it in python especially in pandas because they are slow dealing with large files.
Another question is: how can I write such bash commands in a .sh file (similar to .py file run by python) and run the file in terminal like this:
$bash clean_file.sh inputfile.txt > outputfile.txt
Thank you very much.
Here's what I want to do:
The file is like this:
NODE_1_length_4014_cov_1.97676 1 1
NODE_1_length_4014_cov_1.97676 2 1
NODE_1_length_4014_cov_1.97676 3 1
NODE_1_length_4014_cov_1.97676 4 1
NODE_1_length_4014_cov_1.97676 5 1
NODE_1_length_4014_cov_1.97676 6 1
NODE_1_length_4014_cov_1.97676 7 1
NODE_1_length_4014_cov_1.97676 8 1
NODE_1_length_4014_cov_1.97676 9 1
NODE_1_length_4014_cov_1.97676 10 1
I'd like to clean it using the following steps:
#First, split the first column by the delimiter '_' and only keep the numbers:
awk -F '_' -v OFS='\t' '{print $2,$4,$6,$7,$8}'
#Second, remove the last two empty columns, because, after the first step, it generates two extra invisible columns, which need to be removed.
cut -f 1-5
#remove rows with values in the 2nd column less than 500
awk '$2 >= 500 { print }'
I didn't add 'inputfile' and 'outputfile' in the scripts above, because each step uses previous step's outputfile as the input file. I don't know how to combine the three steps in one script file and save it in the hard drive. I want to run it in terminal for files stored at different locations in my computer.
Thank you very much!
Upvotes: 2
Views: 3956
Reputation: 67467
your second sample input file doesn't have any test condition. So I updated with
$ sed -i '5,$s/4014/300/' file
and it became
NODE_1_length_4014_cov_1.97676 1 1
NODE_1_length_4014_cov_1.97676 2 1
NODE_1_length_4014_cov_1.97676 3 1
NODE_1_length_4014_cov_1.97676 4 1
NODE_1_length_300_cov_1.97676 5 1
NODE_1_length_300_cov_1.97676 6 1
NODE_1_length_300_cov_1.97676 7 1
NODE_1_length_300_cov_1.97676 8 1
NODE_1_length_300_cov_1.97676 9 1
NODE_1_length_300_cov_1.97676 10 1
you want to remove the entries with length less than 500. This simple awk
script will do!
$ awk '{split($1,f1,"_")} f1[4]>=500' file
NODE_1_length_4014_cov_1.97676 1 1
NODE_1_length_4014_cov_1.97676 2 1
NODE_1_length_4014_cov_1.97676 3 1
NODE_1_length_4014_cov_1.97676 4 1
Upvotes: 1
Reputation: 18687
Such filtering is indeed trivial with awk
, but just for completeness (education), here's a bash
-only version:
#!/bin/bash
# "parse"/validate a script's argument (filename)
if [[ ! -e "$1" ]]; then
echo "Usage: $0 FILE"
exit
fi
# iterate over lines, splitting into fields on whitespaces
while read -ra fields; do
(( fields[1] >= 1000 )) && echo "${fields[@]}"
done <"$1"
The usage is like:
$ ./clean_file.sh inputfile.txt > outputfile.txt
Upvotes: 1
Reputation: 798516
bash is the wrong tool.
awk '$2 >= 1000 { print }'
Upvotes: 2