Paul
Paul

Reputation: 1117

Sort files based on content

I have around 1000 files from a phylogenetic analysis and each file looks something like this

File 1
   (((A:0.03550734102561460548,B:0.04004337325891465377):0.01263892787244691278,C:0.03773369182398536581):0.08345900687345568503,(D:0.04441859961888914438,((E:0.04707945363105774106,F:0.03769496882665739068):0.00478087012691866091,G:0.01269975716918288373):0.01263526019405349088):0.13087200352448438712,H:1.91169780510990117151):0.0;

File 12

((A:0.11176787864288327545,(B:0.18283029119402782747,C:0.12136417416322728413):0.02367730683755531543):0.21101090994668161849,(((F:0.06464548582830945134,E:0.06903977777526745796):0.01710921464740109560,G:0.01577242071367901746):0.00958883666063858192,D:0.03506359820882300193):0.47082738536589324729,H:2.94973933657097164840):0.0;

I want to read the content of each file, and classify them according to the patterns (meaning the file content). The numbers here represent the branch length and will not be the same for any of the files. So, I would like to classify the files based on the alphabets A to H. Say, for instance all the files that have the alphabets from A to H arranged in the same order, I would like to sort those files into separate folders. For example:

For the pattern in File1, the pattern will be something like this ignoring the numbers(branch length):

   (((A:,B:),C:):,(D:,((E:,F:):,G:):):,H:):;

And all the files that contain this pattern will go into a folder. File 1 File 5 File 6 File 10 ....

I know to sort contents based on a particular pattern using:

    grep -l -Z pattern files | xargs -0 mv -t target-directory --  

But am not sure how to do it for this case here as I do not have a prior knowledge of the patterns

Upvotes: 1

Views: 428

Answers (1)

karakfa
karakfa

Reputation: 67507

you can get the content patterns and sort them

$ for f in file{1..2}; 
     do printf "%s\t" $f; tr -d '[ 0-9.]' <$f; 
     done | 
  sort -k2

file1   (((A:,B:):,C:):,(D:,((E:,F:):,G:):):,H:):;
file2   ((A:,(B:,C:):):,(((F:,E:):,G:):,D:):,H:):;

same patterns will be consecutive. This assumes you have one record per file.

Upvotes: 2

Related Questions