Istvan
Istvan

Reputation: 8562

How to get cardinality of fields with AWK?

I am trying to count the unique occurrences for each field in a txt file.

Sample:

2008,12,13,6,1007,847,1149,1010,DL,1631,N909DA,162,143,122,99,80,ATL,IAH,689,8,32,0,,0,1,0,19,0,79
2008,12,13,6,638,640,808,753,DL,1632,N604DL,90,73,50,15,-2,JAX,ATL,270,14,26,0,,0,0,0,15,0,0
2008,12,13,6,756,800,1032,1026,DL,1633,N642DL,96,86,56,6,-4,MSY,ATL,425,23,17,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,612,615,923,907,DL,1635,N907DA,131,112,103,16,-3,GEG,SLC,546,5,23,0,,0,0,0,16,0,0
2008,12,13,6,749,750,901,859,DL,1636,N646DL,72,69,41,2,-1,SAV,ATL,215,20,11,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1002,959,1204,1150,DL,1636,N646DL,122,111,71,14,3,ATL,IAD,533,6,45,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,834,835,1021,1023,DL,1637,N908DL,167,168,139,-2,-1,ATL,SAT,874,5,23,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,655,700,856,856,DL,1638,N671DN,121,116,85,0,-5,PBI,ATL,545,24,12,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1251,1240,1446,1437,DL,1639,N646DL,115,117,89,9,11,IAD,ATL,533,13,13,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1110,1103,1413,1418,DL,1641,N908DL,123,135,104,-5,7,SAT,ATL,874,8,11,0,,0,NA,NA,NA,NA,NA

Full dataset here: https://github.com/markgrover/cloudcon-hive (Flight delay dataset from 2008.)

For a single column we can do:

 for i in $(seq 1 28); do cut -d',' -f$i 2008.csv | head |sort | uniq | wc -l ; done |tr '\n' ':' ; echo

Is there a way to do it in one go for all the columns?

I think the expected output looks like this:

 1:1:1:1:10:10:10:10:1:10:9:9:6:9:9:9:2:5:5:5:6:1:1:1:3:2:2:2:

For the entire dataset:

1:12:31:7:1441:1217:1441:1378:20:7539:5374:690:526:664:1154:1135:303:304:1435:191:343:2:5:2:985:600:575:157:

Upvotes: 0

Views: 60

Answers (3)

karakfa
karakfa

Reputation: 67467

another awk

this will give you a rolling counts, pipe to tail -1 to get the last line for the overall counts

$ awk -F, -v OFS=: '{for(i=1;i<=NF;i++) 
                       printf "%s%s", NR-(a[i,$i]++?++c[i]:c[i]),(i==NF)?ORS:OFS}' file

1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1
1:1:1:1:2:2:2:2:1:2:2:2:2:2:2:2:2:2:2:2:2:1:1:1:2:1:2:1:2
1:1:1:1:3:3:3:3:1:3:3:3:3:3:3:3:3:2:3:3:3:1:1:1:3:2:3:2:3
1:1:1:1:4:4:4:4:1:4:4:4:4:4:4:4:4:3:4:4:4:1:1:1:3:2:4:2:3
1:1:1:1:5:5:5:5:1:5:5:5:5:5:5:5:5:3:5:5:5:1:1:1:3:2:4:2:3
1:1:1:1:6:6:6:6:1:5:5:6:6:6:6:6:5:4:6:6:6:1:1:1:3:2:4:2:3
1:1:1:1:7:7:7:7:1:6:6:7:7:7:7:6:5:5:7:6:6:1:1:1:3:2:4:2:3
1:1:1:1:8:8:8:8:1:7:7:8:8:8:8:7:6:5:8:7:7:1:1:1:3:2:4:2:3
1:1:1:1:9:9:9:9:1:8:7:9:9:9:9:8:7:5:8:8:8:1:1:1:3:2:4:2:3
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203358

With GNU awk for true multi-dimensional arrays:

$ cat tst.awk
BEGIN { FS=","; OFS=":" }
{
    for (i=1; i<=NF; i++) {
        vals[i][$i]
    }
}
END {
    for (i=1; i<=NF; i++) {
        printf "%s%s", length(vals[i]), (i<NF?OFS:ORS)
    }
}

$ awk -f tst.awk file
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3

and with any awk:

$ cat tst.awk
BEGIN { FS=","; OFS=":" }
{
    for (i=1; i<=NF; i++) {
        if ( !seen[i,$i]++ ) {
            cnt[i]++
        }
    }
}
END {
    for (i=1; i<=NF; i++) {
        printf "%s%s", cnt[i], (i<NF?OFS:ORS)
    }
}

$ awk -f tst.awk file
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3

Upvotes: 3

James Brown
James Brown

Reputation: 37404

In GNU awk:

$ awk '
BEGIN { FS=OFS="," }                        # delimiters to , 
{
    for(i=1;i<=NF;i++)                      # iterate over every field
        a[i][$i]                            # store unique values to 2d hash
}
END {                                       # after all the records
    for(i=1;i<=NF;i++)                      # iterate the unique values for each field
        for(j in a[i])
            c[i]++                          # count them and 
    for(i=1;i<=NF;i++) 
        printf "%s%s",c[i], (i==NF?ORS:OFS) # output the values
}' file
1,1,1,1,10,10,10,10,1,9,7,10,10,10,10,9,8,5,8,8,8,1,1,1,3,2,4,2,3

The output is not exactly the same, not sure if the mistake is your or mine. Well, the last column has the values 79,0 and NA so mine is more accurate on that one.

Upvotes: 2

Related Questions