Reputation:

AWK: filtering of the multi-column data

I am dealing with the post-procession of CSV log filles arranged in the multi-column format. Usually the first column corresponds to the line number (ID), the second one containts its population (POP, the number of the samples fell into this ID) and the third column (dG) represent some inherent value of this ID (which is always negative):

ID, POP, dG
1, 7, -6.9700
2, 2, -6.9500
3, 2, -6.8500
4, 6, -6.7200
5, 14, -6.7100
6, 5, -6.7000
7, 10, -6.5600
8, 10, -6.4800
9, 7, -6.4500
10, 3, -6.4400
11, 8, -6.4300
12, 10, -6.4200
13, 3, -6.3300
14, 7, -6.2200
15, 1, -6.2000
16, 3, -6.2000
17, 4, -6.1700
18, 1, -6.0500
19, 9, -6.0200
20, 1, -6.0100
21, 1, -6.0000
22, 3, -5.9900
23, 4, -5.9800
24, 3, -5.9200
25, 2, -5.9100
26, 1, -5.8900
27, 1, -5.8500
28, 1, -5.8200
29, 1, -5.7900
30, 8, -5.7800
31, 1, -5.7800
32, 1, -5.7200
33, 3, -5.7100
34, 2, -5.7100
35, 1, -5.6900
36, 4, -5.6800
37, 2, -5.6500
38, 4, -5.6100
39, 1, -5.5900
40, 1, -5.5600
41, 1, -5.5500
42, 2, -5.5500
43, 1, -5.5200
44, 1, -5.5100
45, 2, -5.5000
46, 1, -5.5000
47, 3, -5.4700
48, 2, -5.4500
49, 1, -5.4500
50, 4, -5.4300
51, 1, -5.4300
52, 1, -5.3800
53, 2, -5.3800
54, 1, -5.3500
55, 1, -5.2800
56, 1, -5.2500
57, 2, -5.2500
58, 2, -5.2400
59, 2, -5.2300
60, 1, -5.1400
61, 1, -5.1100
62, 1, -5.1000
63, 2, -5.0300
64, 2, -5.0100
65, 2, -5.0100
66, 1, -4.9700
67, 1, -4.9200
68, 1, -4.9000
69, 2, -4.9000
70, 1, -4.8900
71, 1, -4.8600
72, 3, -4.7900
73, 2, -4.7900
74, 1, -4.7900
75, 1, -4.7700
76, 2, -4.7600
77, 1, -4.7500
78, 1, -4.7400
79, 1, -4.7300
80, 1, -4.7200
81, 2, -4.7100
82, 1, -4.6800
83, 2, -4.6300
84, 1, -4.5500
85, 1, -4.5000
86, 1, -4.4800
87, 2, -4.4500
88, 1, -4.4300
89, 1, -4.3900
90, 1, -4.3000
91, 1, -4.2500
92, 1, -4.2300
93, 1, -4.2200
94, 2, -4.1600
95, 1, -4.1500
96, 1, -4.1100
97, 1, -4.0300
98, 1, -4.0100

I need to reduce the total number of these lines, keeping in the output CSV only the first N lines from the first to the line with the biggest population (POP, the value of the second column) observed in the whole dataset. So in my example the expected output should be the first 5 lines, since the 5th ID has the biggest value of the second column (POP) compared to the all 98 lines:

ID, POP, dG
1, 7, -6.9700
2, 2, -6.9500
3, 2, -6.8500
4, 6, -6.7200
5, 14, -6.7100

Could you suggest me some AWK solution which would accept my CSV file and produce new one after such filtering based on the values in the second column?

Upvotes: 0

Answers (3)

karakfa

Reputation: 67567

$ awk -F, -v minlines=5 'NR==FNR { if($2>=max && NR>1) {max=$2; maxi=NR} next } 
                         FNR<=minlines+1 || FNR<=maxi' file{,}

ID, POP, dG
1, 7, -6.9700
2, 2, -6.9500
3, 2, -6.8500
4, 6, -6.7200
5, 14, -6.7100

this will print until the last occurrence of the max value. If you want the first instance change $2>=max to $2>max

Upvotes: 1

glenn jackman

Reputation: 247192

This approach processed the file twice: once to find the max, and again to print the lines up to the max. I've incorporated your request to print a minimum number of lines.

awk -F ', ' -v min_lines=5 '
    NR == FNR {
        if ($2 > max) max=$2
        next
    }
    {print}
    $2 == max {
        for (i = FNR; i <= min_lines; i++) {
            getline
            print
        }
        exit
    }
' file.csv file.csv

Upvotes: 1

F. Knorr

Reputation: 3065

You could try this awk command:

awk -F "," 'a < $2 {for(idx=0; idx < i; idx++) {print arr[idx]} print $0; a=int($2); i=0} a > $2 && NR > 1 {arr[i]=$0; i++}' input

See demo at: https://awk.js.org/?gist=c8751cc25e444fb2e2b1a8f29849f127

Upvotes: 2

AWK: filtering of the multi-column data

Answers (3)

Related Questions