Reputation: 187
I am trying to look into rows with duplicates from specific fields(columns 1 & 4)in a tab de-limited file and extract specific columns from first and last row of duplicates fields block; only if the previous fields are the same and also values are above 0. For example:
if two columns ($1 and $4) are same at different location interspersed by others, need to treat them as separate blocks
Sample input:
1 tmp1 153446387 153446446 -0.2 1.0888042
2 tmp1 153446925 153446973 0 0.87891006
3 tmp1 153451902 153451951 1.43854 1.2709045
4 tmp1 153454056 153454105 1.43854 1.4132746
5 tmp1 153456192 153456250 1.43854 0.87553155
6 tmp1 153458717 153458776 1.335858 1.1829022
7 tmp1 153460782 153460841 1.335858 0.006651476
8 tmp1 153462035 153462094 0 0.13484457
9 tmp1 153463690 153463749 1.43854 0.45511296
10 tmp1 153467589 153467673 1.43854 1.4431274
11 tmp1 153467873 153468632 0.31841 1.70443
12 tmp1 154451904 154451951 1.43854 1.3709045
13 tmp1 154454054 154454109 1.43854 1.132746
14 tmp1 154456194 154456259 1.43854 0.8553
15 tmp2 153472147 153472194 1.43854 0.99288875
16 tmp2 153476511 153476559 0 0.99288875
Output:
tmp1 153451902 153456250 1.43854
tmp1 153458717 153460841 1.335858
tmp1 153463690 153467673 1.43854
tmp1 154451904 154456259 1.43854
tmp2 153472147 153472194 1.43854
Any ideas on how to go about this
Upvotes: 1
Views: 244
Reputation: 247220
awk '
BEGIN {OFS = FS = "\t"}
function output(key, ary) {
split(key, ary, FS)
print ary[1], start, end, ary[2]
}
$4 <= 0 {next}
key != $1 FS $4 {
if (end) {output(key)}
key = $1 FS $4
start = $2
}
{end = $3}
END {output(key)}
' filename
Upvotes: 2