Steve Xu
Steve Xu

Reputation: 509

remove rows in bash based on values

I have a file like this:

1 4014 1.97676  1   1
1 4014 1.97676  2   1
1 4014 1.97676  3   1
1 2014 1.97676  4   1
1 2014 1.97676  5   1
1 401 1.97676  6   1
1 401 1.97676  7   1
1 401 1.97676  8   1
1 14 1.97676  9   1
1 14 1.97676  10  1

I want to trim this file: Remove rows with values in the 2nd column < 1000. After trimming, the file should look like this:

1 4014 1.97676  1   1
1 4014 1.97676  2   1
1 4014 1.97676  3   1
1 2014 1.97676  4   1
1 2014 1.97676  5   1

How to achieve this in bash? I don't want to do it in python especially in pandas because they are slow dealing with large files.

Another question is: how can I write such bash commands in a .sh file (similar to .py file run by python) and run the file in terminal like this:

$bash clean_file.sh inputfile.txt > outputfile.txt

Thank you very much.



Here's what I want to do:

The file is like this:

NODE_1_length_4014_cov_1.97676  1   1
NODE_1_length_4014_cov_1.97676  2   1
NODE_1_length_4014_cov_1.97676  3   1
NODE_1_length_4014_cov_1.97676  4   1
NODE_1_length_4014_cov_1.97676  5   1
NODE_1_length_4014_cov_1.97676  6   1
NODE_1_length_4014_cov_1.97676  7   1
NODE_1_length_4014_cov_1.97676  8   1
NODE_1_length_4014_cov_1.97676  9   1
NODE_1_length_4014_cov_1.97676  10  1

I'd like to clean it using the following steps:

#First, split the first column by the delimiter '_' and only keep the numbers:
awk -F '_' -v OFS='\t' '{print $2,$4,$6,$7,$8}' 
#Second, remove the last two empty columns, because, after the first step, it generates two extra invisible columns, which need to be removed.
cut -f 1-5 
#remove rows with values in the 2nd column less than 500
awk '$2 >= 500 { print }' 

I didn't add 'inputfile' and 'outputfile' in the scripts above, because each step uses previous step's outputfile as the input file. I don't know how to combine the three steps in one script file and save it in the hard drive. I want to run it in terminal for files stored at different locations in my computer.

Thank you very much!

Upvotes: 2

Views: 3956

Answers (3)

karakfa
karakfa

Reputation: 67467

your second sample input file doesn't have any test condition. So I updated with

$ sed -i '5,$s/4014/300/' file

and it became

NODE_1_length_4014_cov_1.97676  1   1
NODE_1_length_4014_cov_1.97676  2   1
NODE_1_length_4014_cov_1.97676  3   1
NODE_1_length_4014_cov_1.97676  4   1
NODE_1_length_300_cov_1.97676  5   1
NODE_1_length_300_cov_1.97676  6   1
NODE_1_length_300_cov_1.97676  7   1
NODE_1_length_300_cov_1.97676  8   1
NODE_1_length_300_cov_1.97676  9   1
NODE_1_length_300_cov_1.97676  10  1

you want to remove the entries with length less than 500. This simple awk script will do!

$ awk '{split($1,f1,"_")} f1[4]>=500' file

NODE_1_length_4014_cov_1.97676  1   1
NODE_1_length_4014_cov_1.97676  2   1
NODE_1_length_4014_cov_1.97676  3   1
NODE_1_length_4014_cov_1.97676  4   1

Upvotes: 1

randomir
randomir

Reputation: 18687

Such filtering is indeed trivial with awk, but just for completeness (education), here's a bash-only version:

#!/bin/bash

# "parse"/validate a script's argument (filename)
if [[ ! -e "$1" ]]; then
    echo "Usage: $0 FILE"
    exit
fi

# iterate over lines, splitting into fields on whitespaces
while read -ra fields; do
    (( fields[1] >= 1000 )) && echo "${fields[@]}"
done <"$1"

The usage is like:

$ ./clean_file.sh inputfile.txt > outputfile.txt

Upvotes: 1

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 798516

bash is the wrong tool.

awk '$2 >= 1000 { print }'

Upvotes: 2

Related Questions