bapors
bapors

Reputation: 909

Sorting a file and putting them in different files

I am trying to sort a file that has different genomic regions, and each region has a letter&number combination to itself.

I want to sort the whole file in terms of each genomic location (columns1,2,3),and if these 3 are the same, and extract it into a new separate file.

My input is:

1.txt
chr1    10  20 . . 00000 ACTGBACA
chr1    10  20 . + 11111 AACCCCHQ
chr1    18  40 . . 0 AA12KCCHQ
chr7    22  23 . . 21 KLJMWQKD
chr7    22  23 . . 8 XJKFIRHFBF24
chrX    199 201 . . KK AVJI24

What I am expecting is:

chr1.10-20.txt
chr1    10  20 ACTGBACA
chr1    10  20 AACCCCHQ


chr1.18-40.txt
chr1    18  40 AA12KCCHQ

chr7.22-23.txt
chr7    22  23 KLJMWQKD
chr7    22  23 XJKFIRHFBF24

chrX.199-201.txt
chrX    199 201 AVJI24

I was experimenting splitting a document with awk, but it is not what I want to do.

awk -F, '{print > $1$2$3".txt"}' 1.txt

It gives me the file names with all the rows, and inside the files, it is again the whole row, even though I need only column 1,2,3 and 7.

>ls
1.txt                                  
chr1    10  20 . + 11111 AACCCCHQ.txt  
chr7    22  23 . . 21 KLJMWQKD.txt     
chrX    199 201 . . KK AVJI24.txt  
chr1    10  20 . . 00000 ACTGBACA.txt  
chr1    18  40 . . 0 AA12KCCHQ.txt     
chr7    22  23 . . 8 XJKFIRHFBF24.txt   

>cat chr1\ \ \ \ 10\ \ 20\ .\ +\ 11111\ AACCCCHQ.txt 
chr1    10  20 . + 11111 AACCCCHQ

I would appreciate if you can show me how to fix the file names and its content.

Upvotes: 1

Views: 33

Answers (1)

John Moon
John Moon

Reputation: 924

Take a look at this:

#!/bin/sh
INPUT="$1"

while read -r LINE; do
    GEN_LOC="$(echo "$LINE" | tr -s ' ' '.' | cut -d '.' -f 1,2,3)"
    echo "$LINE" | tr -s ' ' | cut -d ' ' -f 1,2,3,6,7 >> "${GEN_LOC}.txt"
done < "$INPUT"

This script will take an input file in the format you posted and read it in line-by-line. For each line, it will replace the extra whitespace with dots for the filename and cut it down to fields 1, 2, and 3 (storing it in the $GEN_LOC variable). Then, it will append the whole $LINE to a file named ${GEN_LOC}.txt. If there are multiple lines that end up outputting to the same filename, that's fine - the line will just append. This does not take into account previous runs, so if you run this twice, it will continually append to the existing files. Hope this helps!

Upvotes: 1

Related Questions