5heikki
5heikki

Reputation: 152

Print all rows that are in specified value range based on specific column as pairs in awk

Input data is sorted based on column 2 as below:

1   100
1   101
1   200
3   360
4   800
4   950
4   952

With example data desired output is:

1   200 3   360
4   800 4   950
4   800 4   952

That is, if there are lines that have values in column 2 that are in range: value2 is greater than value1+100 && value2 is less than value1+200.

My attempt was:

awk 'BEGIN{FS="\t"; PREVLOC=$2; PREVLINE=$0}{ if($2>PREVLOC+200 || $2<PREVLOC+100 {PREVLOC=$2; PREVLINE=$0;} else {print PREVLINE"\t"$0; PREVLOC=$2; PREVLINE=$0;} }' inputfile

Which saves the previous line and previous line column 2 into variables for comparisons. However, it does not work in all cases. With example data, it would not print the last pair. Also it wouldn't output the 800 - 950 pair if there was a line between them where second column value was e.g. 890.

Currently, I have solved the problem in completely different way in bash with:

`while read var1 var2; do stuff with vars in awk; done<inputfile`

But it's very slow. Any help is much appreciated.

Upvotes: 1

Views: 199

Answers (1)

jas
jas

Reputation: 10865

I don't know how much of an improvement this will be for you as it's still an O(n^2) algorithm, but it's all in awk and worth a try.

There are two passes. The NR==FNR block is the first pass and reads the entire file into memory (another possible issue if the file is extremely large, and I'm guessing it's pretty large if you're worried about performance). For each row we store the range to be tested against in the second pass.

The second pass goes line by line and scans for each the full set of ranges to find those that match the condition.

Be sure to note that you need to provide the input file twice on the command line when invoking awk, as shown.

$ cat input.txt
1   100
1   101
1   200
3   360
4   800
4   950
4   952

$ cat b.awk
# first pass, load array with ranges
NR==FNR {range[$0] = ($2 + 100) ":" ($2 + 200); next}

# Here we process the file for the second time, looping through
# all ranges for every line of input
{
    for (i in range) {
        split(range[i], r, ":")
        if ($2 > r[1] && $2 < r[2]) {
            print i, $0
        }
    }
}

$ awk -f b.awk input.txt input.txt
1   200 3   360
4   800 4   950
4   800 4   952

Upvotes: 1

Related Questions