MPR6
MPR6

Reputation: 27

AWK - Counting and printing different patterns

I have a file with 13 columns. It looks just like this:

M01562:52:000000000-A9Y4G:1:1101:10000:13082_1:N:0:1    gene_id_8535    100.00  254 0   0   1   254 302 49  3.2e-140    495.0   254
M01562:52:000000000-A9Y4G:1:1101:10000:18672_1:N:0:1    gene_id_118536  100.00  193 0   0   1   193 54  246 1.6e-103    373.0   193
M01562:52:000000000-A9Y4G:1:1101:10000:18672_2:N:0:1    gene_id_118536  98.83   257 3   0   1   257 427 171 3.4e-137    485.0   257
M01562:52:000000000-A9Y4G:1:1101:10000:21866_2:N:0:1    gene_id_120720  100.00  195 0   0   1   195 448 254 4.9e-104    375.0   200
M01562:52:000000000-A9Y4G:1:1101:10000:5922_1:N:0:1     gene_id_17051   100.00  149 0   0   1   149 1849    1701    3.4e-78 289.0   149
M01562:52:000000000-A9Y4G:1:1101:10000:5922_2:N:0:1     gene_id_17051   100.00  123 0   0   1   123 1522    1644    1.3e-62 237.0   123
M01562:52:000000000-A9Y4G:1:1101:10000:6256_1:N:0:1     gene_id_121202  98.73   157 2   0   1   157 179 23  1.9e-81 300.0   157
M01562:52:000000000-A9Y4G:1:1101:10001:11433_1:N:0:1    gene_id_125209  99.07   108 1   0   1   108 118 11  1.8e-53 207.0   108
M01562:52:000000000-A9Y4G:1:1101:10001:11433_2:N:0:1    gene_id_125209  99.15   118 1   0   4   121 1   118 2.9e-59 226.0   121
M01562:52:000000000-A9Y4G:1:1101:10001:17591_1:N:0:1    gene_id_2387    100.00  152 0   0   1   152 1378    1529    2.2e-80 296.0   152
M01562:52:000000000-A9Y4G:1:1101:10001:17591_2:N:0:1    gene_id_2387    100.00  152 0   0   1   152 1529    1378    2.2e-80 296.0   152
M01562:52:000000000-A9Y4G:1:1101:10001:17844_1:N:0:1    gene_id_9456    100.00  100 0   0   1   100 176 275 8.5e-50 194.0   100

Now, what I need to do is count the second column, which are some gene IDs, and print into a separate file that has each gene ID and its number of repetitions or times it appears in the file. Just like this:

gene_id_9456           2
gene_id_125209         5
gene_id_2387           2

The gene IDs have different number if characters and are different altogether so everything I have tried doesn't work...

Also, could anyone recommend some really good websites to learn about awk? I have been reading http://www.grymoire.com/Unix/Awk.html but would like to have more sources.

Upvotes: 1

Views: 102

Answers (2)

Ed Morton
Ed Morton

Reputation: 204218

Chances are all you need is something like this:

awk '{cnt[$2]++} END{for (gene in cnt) print gene, cnt[gene]}' file

but without sample input and expected output it's just a guess.

Upvotes: 2

Axel Kemper
Axel Kemper

Reputation: 11322

The following awk script countGenes.awk would do the job:

#
#  countGenes.awk
#

BEGIN {
    columns = 13
    geneColumn = 2
}

#  select only lines with the expected number of fields    
(NF == columns) {
    geneCounts[$geneColumn]++
}

END {
    #  loop through the associative table of counts
    for (gene in geneCounts) {
        #  write count to file
        fileName = "count_" gene ".txt"
        printf "%s\t%d\n", gene, geneCounts[gene] >fileName

        #  for logging
        printf "%s\t%d\n", gene, geneCounts[gene]
    }
}

Run the script with the following command:

awk -f countGenes.awk testDataGenes.txt

Upvotes: 1

Related Questions