Reputation: 909
I am trying to sort a file that has different genomic regions, and each region has a letter&number combination to itself.
I want to sort the whole file in terms of each genomic location (columns1,2,3)
,and if these 3 are the same,
and extract it into a new separate file.
My input is:
1.txt
chr1 10 20 . . 00000 ACTGBACA
chr1 10 20 . + 11111 AACCCCHQ
chr1 18 40 . . 0 AA12KCCHQ
chr7 22 23 . . 21 KLJMWQKD
chr7 22 23 . . 8 XJKFIRHFBF24
chrX 199 201 . . KK AVJI24
What I am expecting is:
chr1.10-20.txt
chr1 10 20 ACTGBACA
chr1 10 20 AACCCCHQ
chr1.18-40.txt
chr1 18 40 AA12KCCHQ
chr7.22-23.txt
chr7 22 23 KLJMWQKD
chr7 22 23 XJKFIRHFBF24
chrX.199-201.txt
chrX 199 201 AVJI24
I was experimenting splitting a document with awk
, but it is not what I want to do.
awk -F, '{print > $1$2$3".txt"}' 1.txt
It gives me the file names with all the rows, and inside the files, it is again the whole row, even though I need only column 1,2,3 and 7.
>ls
1.txt
chr1 10 20 . + 11111 AACCCCHQ.txt
chr7 22 23 . . 21 KLJMWQKD.txt
chrX 199 201 . . KK AVJI24.txt
chr1 10 20 . . 00000 ACTGBACA.txt
chr1 18 40 . . 0 AA12KCCHQ.txt
chr7 22 23 . . 8 XJKFIRHFBF24.txt
>cat chr1\ \ \ \ 10\ \ 20\ .\ +\ 11111\ AACCCCHQ.txt
chr1 10 20 . + 11111 AACCCCHQ
I would appreciate if you can show me how to fix the file names and its content.
Upvotes: 1
Views: 33
Reputation: 924
Take a look at this:
#!/bin/sh
INPUT="$1"
while read -r LINE; do
GEN_LOC="$(echo "$LINE" | tr -s ' ' '.' | cut -d '.' -f 1,2,3)"
echo "$LINE" | tr -s ' ' | cut -d ' ' -f 1,2,3,6,7 >> "${GEN_LOC}.txt"
done < "$INPUT"
This script will take an input file in the format you posted and read it in line-by-line. For each line, it will replace the extra whitespace with dots for the filename and cut it down to fields 1, 2, and 3 (storing it in the $GEN_LOC
variable). Then, it will append the whole $LINE
to a file named ${GEN_LOC}.txt
. If there are multiple lines that end up outputting to the same filename, that's fine - the line will just append. This does not take into account previous runs, so if you run this twice, it will continually append to the existing files. Hope this helps!
Upvotes: 1