Reputation: 421
I have an original file which have data in below particular format:
$ cat sample.txt
>MA0002.1 RUNX1
A [ 10 12 4 1 2 2 0 0 0 8 13 ]
C [ 2 2 7 1 0 8 0 0 1 2 2 ]
G [ 3 1 1 0 23 0 26 26 0 0 4 ]
T [ 11 11 14 24 1 16 0 0 25 16 7 ]
>MA0003.1 TFAP2A
A [ 0 0 0 22 19 55 53 19 9 ]
C [ 0 185 185 71 57 44 30 16 78 ]
G [ 185 0 0 46 61 67 91 137 79 ]
T [ 0 0 0 46 48 19 11 13 19 ]
>MA0003.3 TFAP2C
A [ 1706 137 0 0 33 575 3640 1012 0 31 1865 ]
C [ 1939 968 5309 5309 1646 2682 995 224 31 4726 798 ]
G [ 277 4340 139 11 658 1613 618 5309 5309 582 1295 ]
T [ 1386 47 0 281 2972 438 56 0 0 21 1350 ]
I want to divide this file into individual files based on letter >
and I know that this character appears after every 5th line. I can do this by:
awk 'NR%5==1{x="F"++i;}{print > x}' sample.txt
problem is it creates multiple files correctly but file names are F1, F2 and F3 respectively, and without any extension. I want to save this individual files by the names mentioned in their first line, which are RUNX1
, TFAP2A
and TFAP2C
and with an extension of .pfm
.
So that final files will look like:
$ cat RUNX1.pfm
>MA0002.1 RUNX1
A [ 10 12 4 1 2 2 0 0 0 8 13 ]
C [ 2 2 7 1 0 8 0 0 1 2 2 ]
G [ 3 1 1 0 23 0 26 26 0 0 4 ]
T [ 11 11 14 24 1 16 0 0 25 16 7 ]
$ cat TFAP2A.pfm
>MA0003.1 TFAP2A
A [ 0 0 0 22 19 55 53 19 9 ]
C [ 0 185 185 71 57 44 30 16 78 ]
G [ 185 0 0 46 61 67 91 137 79 ]
T [ 0 0 0 46 48 19 11 13 19 ]
and so on..
Thank you for taking out the time to help me!
Upvotes: 2
Views: 1259
Reputation: 58371
This might work for you (GNU sed & csplit):
csplit -z file '/^>/' '{*}'
sed -ns '1F;1s/^\S\+\s*//p' xx* | sed 'N;s/\n/ /;s/^/mv -v /e'
Use csplit to do the work of splitting the files up using the pattern ^>
i.e. a >
at the start of a line signifies new file. Then use two invocations of sed to rename the files. The first outputs the original file name and its intended name. The second adds and executes the move command. Place the files in a separated directory and use head *
to check the results.
Upvotes: 1
Reputation: 16997
Below one takes care, if name is used more than one time
One-liner:
awk '/>/{f=$2 (a[$2]++?"."a[$2]-1:"") ".pfm"; if(f!=p){ close(p); p=f}}{print >f}' file
Better Readable:
awk '/>/{
f=$2 (a[$2]++?"."a[$2]-1:"") ".pfm";
if(f!=p){
close(p);
p=f
}
}
{
print >f
}
' file
Input:
$ cat file
>MA0002.1 RUNX1
A [ 10 12 4 1 2 2 0 0 0 8 13 ]
C [ 2 2 7 1 0 8 0 0 1 2 2 ]
G [ 3 1 1 0 23 0 26 26 0 0 4 ]
T [ 11 11 14 24 1 16 0 0 25 16 7 ]
>MA0003.3 TFAP2C
A [ 1706 137 0 0 33 575 3640 1012 0 31 1865 ]
C [ 1939 968 5309 5309 1646 2682 995 224 31 4726 798 ]
G [ 277 4340 139 11 658 1613 618 5309 5309 582 1295 ]
T [ 1386 47 0 281 2972 438 56 0 0 21 1350 ]
>MA0003.1 TFAP2A
A [ 0 0 0 22 19 55 53 19 9 ]
C [ 0 185 185 71 57 44 30 16 78 ]
G [ 185 0 0 46 61 67 91 137 79 ]
T [ 0 0 0 46 48 19 11 13 19 ]
>MA0003.3 TFAP2C
A [ 1706 137 0 0 33 575 3640 1012 0 31 1865 ]
C [ 1939 968 5309 5309 1646 2682 995 224 31 4726 798 ]
G [ 277 4340 139 11 658 1613 618 5309 5309 582 1295 ]
T [ 1386 47 0 281 2972 438 56 0 0 21 1350 ]
Execution:
$ awk '/>/{f=$2 (a[$2]++?"."a[$2]-1:"") ".pfm"; if(f!=p){ close(p); p=f}}{print >f}' file
Output files:
$ ls *.pfm -1
RUNX1.pfm
TFAP2A.pfm
TFAP2C.1.pfm
TFAP2C.pfm
Contents of each file:
$ for i in *.pfm; do echo "Output File:$i"; cat "$i"; done
Output File:RUNX1.pfm
>MA0002.1 RUNX1
A [ 10 12 4 1 2 2 0 0 0 8 13 ]
C [ 2 2 7 1 0 8 0 0 1 2 2 ]
G [ 3 1 1 0 23 0 26 26 0 0 4 ]
T [ 11 11 14 24 1 16 0 0 25 16 7 ]
Output File:TFAP2A.pfm
>MA0003.1 TFAP2A
A [ 0 0 0 22 19 55 53 19 9 ]
C [ 0 185 185 71 57 44 30 16 78 ]
G [ 185 0 0 46 61 67 91 137 79 ]
T [ 0 0 0 46 48 19 11 13 19 ]
Output File:TFAP2C.1.pfm
>MA0003.3 TFAP2C
A [ 1706 137 0 0 33 575 3640 1012 0 31 1865 ]
C [ 1939 968 5309 5309 1646 2682 995 224 31 4726 798 ]
G [ 277 4340 139 11 658 1613 618 5309 5309 582 1295 ]
T [ 1386 47 0 281 2972 438 56 0 0 21 1350 ]
Output File:TFAP2C.pfm
>MA0003.3 TFAP2C
A [ 1706 137 0 0 33 575 3640 1012 0 31 1865 ]
C [ 1939 968 5309 5309 1646 2682 995 224 31 4726 798 ]
G [ 277 4340 139 11 658 1613 618 5309 5309 582 1295 ]
T [ 1386 47 0 281 2972 438 56 0 0 21 1350 ]
Upvotes: 1
Reputation: 133428
Following awk may help you in same.
awk '/^>/{if(file){close(file)};file=$2".pfm"} {print > file".pfm"}' Input_file
Adding a non-one liner form with explanation too here.
awk '
/^>/{ ##Checking here if any line starts with ">" if yes then do following actions.
if(file){ ##Checking if value of variable named file is NOT NULL, if condition is TRUE then do following.
close(file) ##close is awk out of the box command which will close any opened file, so that we could avoid situation of too many files opened at a time.
};
file=$2".pfm" ##Setting variable named file to 2nd filed of the line which starts from ">" here.
}
{
print > file".pfm"##Printing the value of current line to file".pfm" which will create file with $2 and .pfm name and put output into output files.
}
' Input_file ##Mentioning the Input_file name here.
EDIT:
awk '/^>/{if(file){close(file)};array[$2]++;file=array[$2]?(array[$2]==1?$2:$2"."array[$2]):$2} {print > file".pfm"}' Input_file
Upvotes: 2
Reputation: 3079
That's it
awk -v RS=">" '{print RS$0 > $2".pfm"; close($2".pfm")}' file
To save a new file if the file with the same name was already saved then use this one :
awk -v RS=">" '{a[$2]++; if(a[$2]>1) file=$2"."a[$2]; else file=$2; print RS$0 > file".pfm" ; close(file".pfm")}' file
For eg. if TFAP2A.pfm was saved before then new file will be saved as TFAP2A.2.pfm TFAP2A.3.pfm .... and so on
OR Simply
awk -v RS=">" '{file=$2"."++a[$2]; print RS$0 > file".pfm" ; close(file".pfm")}' file
If you want to save each file with version Ex. abc.1.pfm abc.2.pfm
Upvotes: 2
Reputation: 92854
awk approach:
awk 'NR%5==1{ fn=$2".pfm" }fn{ print > fn}' file
Or the same using >
mark:
awk '/^>/{ fn=$2".pfm" }fn{ print > fn}' file
Upvotes: 1