Reputation: 27
I have a file with 13 columns. It looks just like this:
M01562:52:000000000-A9Y4G:1:1101:10000:13082_1:N:0:1 gene_id_8535 100.00 254 0 0 1 254 302 49 3.2e-140 495.0 254
M01562:52:000000000-A9Y4G:1:1101:10000:18672_1:N:0:1 gene_id_118536 100.00 193 0 0 1 193 54 246 1.6e-103 373.0 193
M01562:52:000000000-A9Y4G:1:1101:10000:18672_2:N:0:1 gene_id_118536 98.83 257 3 0 1 257 427 171 3.4e-137 485.0 257
M01562:52:000000000-A9Y4G:1:1101:10000:21866_2:N:0:1 gene_id_120720 100.00 195 0 0 1 195 448 254 4.9e-104 375.0 200
M01562:52:000000000-A9Y4G:1:1101:10000:5922_1:N:0:1 gene_id_17051 100.00 149 0 0 1 149 1849 1701 3.4e-78 289.0 149
M01562:52:000000000-A9Y4G:1:1101:10000:5922_2:N:0:1 gene_id_17051 100.00 123 0 0 1 123 1522 1644 1.3e-62 237.0 123
M01562:52:000000000-A9Y4G:1:1101:10000:6256_1:N:0:1 gene_id_121202 98.73 157 2 0 1 157 179 23 1.9e-81 300.0 157
M01562:52:000000000-A9Y4G:1:1101:10001:11433_1:N:0:1 gene_id_125209 99.07 108 1 0 1 108 118 11 1.8e-53 207.0 108
M01562:52:000000000-A9Y4G:1:1101:10001:11433_2:N:0:1 gene_id_125209 99.15 118 1 0 4 121 1 118 2.9e-59 226.0 121
M01562:52:000000000-A9Y4G:1:1101:10001:17591_1:N:0:1 gene_id_2387 100.00 152 0 0 1 152 1378 1529 2.2e-80 296.0 152
M01562:52:000000000-A9Y4G:1:1101:10001:17591_2:N:0:1 gene_id_2387 100.00 152 0 0 1 152 1529 1378 2.2e-80 296.0 152
M01562:52:000000000-A9Y4G:1:1101:10001:17844_1:N:0:1 gene_id_9456 100.00 100 0 0 1 100 176 275 8.5e-50 194.0 100
Now, what I need to do is count the second column, which are some gene IDs, and print into a separate file that has each gene ID and its number of repetitions or times it appears in the file. Just like this:
gene_id_9456 2
gene_id_125209 5
gene_id_2387 2
The gene IDs have different number if characters and are different altogether so everything I have tried doesn't work...
Also, could anyone recommend some really good websites to learn about awk? I have been reading http://www.grymoire.com/Unix/Awk.html but would like to have more sources.
Upvotes: 1
Views: 102
Reputation: 204218
Chances are all you need is something like this:
awk '{cnt[$2]++} END{for (gene in cnt) print gene, cnt[gene]}' file
but without sample input and expected output it's just a guess.
Upvotes: 2
Reputation: 11322
The following awk
script countGenes.awk
would do the job:
#
# countGenes.awk
#
BEGIN {
columns = 13
geneColumn = 2
}
# select only lines with the expected number of fields
(NF == columns) {
geneCounts[$geneColumn]++
}
END {
# loop through the associative table of counts
for (gene in geneCounts) {
# write count to file
fileName = "count_" gene ".txt"
printf "%s\t%d\n", gene, geneCounts[gene] >fileName
# for logging
printf "%s\t%d\n", gene, geneCounts[gene]
}
}
Run the script with the following command:
awk -f countGenes.awk testDataGenes.txt
Upvotes: 1