Newbie
Newbie

Reputation: 421

awk to divide an individual file into multiple files with particular file names

I have an original file which have data in below particular format:

$ cat sample.txt
>MA0002.1   RUNX1
A  [    10     12      4      1      2      2      0      0      0      8     13 ]
C  [     2      2      7      1      0      8      0      0      1      2      2 ]
G  [     3      1      1      0     23      0     26     26      0      0      4 ]
T  [    11     11     14     24      1     16      0      0     25     16      7 ]
>MA0003.1   TFAP2A
A  [     0      0      0     22     19     55     53     19      9 ]
C  [     0    185    185     71     57     44     30     16     78 ]
G  [   185      0      0     46     61     67     91    137     79 ]
T  [     0      0      0     46     48     19     11     13     19 ]
>MA0003.3   TFAP2C
A  [  1706    137      0      0     33    575   3640   1012      0     31   1865 ]
C  [  1939    968   5309   5309   1646   2682    995    224     31   4726    798 ]
G  [   277   4340    139     11    658   1613    618   5309   5309    582   1295 ]
T  [  1386     47      0    281   2972    438     56      0      0     21   1350 ]

I want to divide this file into individual files based on letter > and I know that this character appears after every 5th line. I can do this by:

awk 'NR%5==1{x="F"++i;}{print > x}' sample.txt

problem is it creates multiple files correctly but file names are F1, F2 and F3 respectively, and without any extension. I want to save this individual files by the names mentioned in their first line, which are RUNX1, TFAP2A and TFAP2C and with an extension of .pfm.

So that final files will look like:

$ cat RUNX1.pfm
>MA0002.1   RUNX1
A  [    10     12      4      1      2      2      0      0      0      8     13 ]
C  [     2      2      7      1      0      8      0      0      1      2      2 ]
G  [     3      1      1      0     23      0     26     26      0      0      4 ]
T  [    11     11     14     24      1     16      0      0     25     16      7 ]

$ cat TFAP2A.pfm
>MA0003.1   TFAP2A
A  [     0      0      0     22     19     55     53     19      9 ]
C  [     0    185    185     71     57     44     30     16     78 ]
G  [   185      0      0     46     61     67     91    137     79 ]
T  [     0      0      0     46     48     19     11     13     19 ]

and so on..

Thank you for taking out the time to help me!

Upvotes: 2

Views: 1259

Answers (5)

potong
potong

Reputation: 58371

This might work for you (GNU sed & csplit):

csplit -z file '/^>/' '{*}'
sed -ns '1F;1s/^\S\+\s*//p' xx* | sed 'N;s/\n/ /;s/^/mv -v /e'

Use csplit to do the work of splitting the files up using the pattern ^> i.e. a > at the start of a line signifies new file. Then use two invocations of sed to rename the files. The first outputs the original file name and its intended name. The second adds and executes the move command. Place the files in a separated directory and use head * to check the results.

Upvotes: 1

Akshay Hegde
Akshay Hegde

Reputation: 16997

Below one takes care, if name is used more than one time

One-liner:

awk '/>/{f=$2 (a[$2]++?"."a[$2]-1:"") ".pfm"; if(f!=p){ close(p); p=f}}{print >f}' file

Better Readable:

 awk '/>/{
           f=$2 (a[$2]++?"."a[$2]-1:"") ".pfm"; 
           if(f!=p){ 
                close(p); 
                p=f
           }
          }
          {
            print >f
          }
     ' file

Input:

$ cat file
>MA0002.1   RUNX1
A  [    10     12      4      1      2      2      0      0      0      8     13 ]
C  [     2      2      7      1      0      8      0      0      1      2      2 ]
G  [     3      1      1      0     23      0     26     26      0      0      4 ]
T  [    11     11     14     24      1     16      0      0     25     16      7 ]
>MA0003.3   TFAP2C
A  [  1706    137      0      0     33    575   3640   1012      0     31   1865 ]
C  [  1939    968   5309   5309   1646   2682    995    224     31   4726    798 ]
G  [   277   4340    139     11    658   1613    618   5309   5309    582   1295 ]
T  [  1386     47      0    281   2972    438     56      0      0     21   1350 ]
>MA0003.1   TFAP2A
A  [     0      0      0     22     19     55     53     19      9 ]
C  [     0    185    185     71     57     44     30     16     78 ]
G  [   185      0      0     46     61     67     91    137     79 ]
T  [     0      0      0     46     48     19     11     13     19 ]
>MA0003.3   TFAP2C
A  [  1706    137      0      0     33    575   3640   1012      0     31   1865 ]
C  [  1939    968   5309   5309   1646   2682    995    224     31   4726    798 ]
G  [   277   4340    139     11    658   1613    618   5309   5309    582   1295 ]
T  [  1386     47      0    281   2972    438     56      0      0     21   1350 ]

Execution:

$ awk '/>/{f=$2 (a[$2]++?"."a[$2]-1:"") ".pfm"; if(f!=p){ close(p); p=f}}{print >f}' file

Output files:

$ ls *.pfm -1
RUNX1.pfm
TFAP2A.pfm
TFAP2C.1.pfm
TFAP2C.pfm

Contents of each file:

$ for i in *.pfm; do echo "Output File:$i"; cat "$i"; done
Output File:RUNX1.pfm
>MA0002.1   RUNX1
A  [    10     12      4      1      2      2      0      0      0      8     13 ]
C  [     2      2      7      1      0      8      0      0      1      2      2 ]
G  [     3      1      1      0     23      0     26     26      0      0      4 ]
T  [    11     11     14     24      1     16      0      0     25     16      7 ]
Output File:TFAP2A.pfm
>MA0003.1   TFAP2A
A  [     0      0      0     22     19     55     53     19      9 ]
C  [     0    185    185     71     57     44     30     16     78 ]
G  [   185      0      0     46     61     67     91    137     79 ]
T  [     0      0      0     46     48     19     11     13     19 ]
Output File:TFAP2C.1.pfm
>MA0003.3   TFAP2C
A  [  1706    137      0      0     33    575   3640   1012      0     31   1865 ]
C  [  1939    968   5309   5309   1646   2682    995    224     31   4726    798 ]
G  [   277   4340    139     11    658   1613    618   5309   5309    582   1295 ]
T  [  1386     47      0    281   2972    438     56      0      0     21   1350 ]

Output File:TFAP2C.pfm
>MA0003.3   TFAP2C
A  [  1706    137      0      0     33    575   3640   1012      0     31   1865 ]
C  [  1939    968   5309   5309   1646   2682    995    224     31   4726    798 ]
G  [   277   4340    139     11    658   1613    618   5309   5309    582   1295 ]
T  [  1386     47      0    281   2972    438     56      0      0     21   1350 ]

Upvotes: 1

RavinderSingh13
RavinderSingh13

Reputation: 133428

Following awk may help you in same.

awk '/^>/{if(file){close(file)};file=$2".pfm"} {print > file".pfm"}'  Input_file

Adding a non-one liner form with explanation too here.

awk '
/^>/{             ##Checking here if any line starts with ">" if yes then do following actions.
  if(file){       ##Checking if value of variable named file is NOT NULL, if condition is TRUE then do following.
    close(file)   ##close is awk out of the box command which will close any opened file, so that we could avoid situation of too many files opened at a time.
};
  file=$2".pfm"   ##Setting variable named file to 2nd filed of the line which starts from ">" here.
}
{
print > file".pfm"##Printing the value of current line to file".pfm" which will create file with $2 and .pfm name and put output into output files.
}
' Input_file      ##Mentioning the Input_file name here.

EDIT:

awk '/^>/{if(file){close(file)};array[$2]++;file=array[$2]?(array[$2]==1?$2:$2"."array[$2]):$2} {print > file".pfm"}'  Input_file

Upvotes: 2

Rahul Verma
Rahul Verma

Reputation: 3079

That's it

awk -v RS=">" '{print RS$0 > $2".pfm"; close($2".pfm")}' file

To save a new file if the file with the same name was already saved then use this one :

awk -v RS=">" '{a[$2]++; if(a[$2]>1) file=$2"."a[$2]; else file=$2; print RS$0 > file".pfm" ; close(file".pfm")}' file

For eg. if TFAP2A.pfm was saved before then new file will be saved as TFAP2A.2.pfm TFAP2A.3.pfm .... and so on

OR Simply

awk -v RS=">" '{file=$2"."++a[$2]; print RS$0 > file".pfm" ; close(file".pfm")}' file

If you want to save each file with version Ex. abc.1.pfm abc.2.pfm

Upvotes: 2

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

awk approach:

awk 'NR%5==1{ fn=$2".pfm" }fn{ print > fn}' file

Or the same using > mark:

awk '/^>/{ fn=$2".pfm" }fn{ print > fn}' file

Upvotes: 1

Related Questions