theo4786
theo4786

Reputation: 159

awk for loop to break up file into chunks

I have a large file that I would like to break into chunks by field 2. Field 2 ranges in value from about 0 to about 250 million.

1 10492 rs55998931 C T 6 7 3 3 - 0.272727272727273 0.4375
1 13418 . G A 6 1 2 3 DDX11L1 0.25 0.0625
1 13752 . T C 4 4 1 3 DDX11L1 0.153846153846154 0.25
1 13813 . T G 1 4 0 1 DDX11L1 0.0357142857142857 0.2
1 13838 rs200683566 C T 1 4 0 1 DDX11L1 0.0357142857142857 0.2

I want field 2 to be broken up into intervals of 50,000, but overlapping by 2,000. For example, the first three awk commands would look like:

awk '$1=="1" && $2>=0 && $2<=50000{print$0}' Highalt.Lowalt.allelecounts.filteredformissing.freq > chr1.0kb.50kb

awk '$1=="1" && $2>=48000 && $2<=98000{print$0}' Highalt.Lowalt.allelecounts.filteredformissing.freq > chr1.48kb.98kb

awk '$1=="1" && $2>=96000 && $2<=146000{print$0}' Highalt.Lowalt.allelecounts.filteredformissing.freq > chr1.96kb.146kb

I know that there's a way I can do this using a for loop with variables like i and j. Can someone help me out?

Upvotes: 1

Views: 88

Answers (1)

John1024
John1024

Reputation: 113824

awk '$1=="1"{n=int($2/48000); print>("chr1." (48*n) "kb." (48*n+50) "kb");n--; if (n>=0 && $2/1000<=48*n+50) print>("chr1." (48*n) "kb." (48*n+50) "kb");}' Highalt.Lowalt.allelecounts.filteredformissing.freq

Or spread out over multiple lines:

awk '$1=="1"{
    n=int($2/48000)
    print>("chr1." (48*n) "kb." (48*n+50) "kb")
    n--
    if (n>=0 && $2/1000<=48*n+50)
        print>("chr1." (48*n) "kb." (48*n+50) "kb")
}' Highalt.Lowalt.allelecounts.filteredformissing.freq

How it works

  • $1=="1"{

    This selects all lines whose first field is 1. (You didn't mention this in the text but your code applied this restriction.

  • n=int($2/48000)

    This computes which bucket the line belongs in.

  • print>("chr1." (48*n) "kb." (48*n+50) "kb")

    This writes the line to the appropriate file

  • n--

    This decrements the bucket number

  • if (n>=0 && $2/1000<=48*n+50) print>("chr1." (48*n) "kb." (48*n+50) "kb")

    If this line also fits within the overlapping range of the previous bucket, then write it to that bucket also.

  • }

    This closes the group started by selecting $1=="1".

Upvotes: 2

Related Questions