MarineBioAlex
MarineBioAlex

Reputation: 21

Parsing unique data and renaming files

I was trying to create a Perl script to rename the files (hundreds of files with different names), but I have not had any success. I first need to find the unique file number and then rename it to something more human readable. Since file names are not sequential, it makes it difficult.

Examples of files names: The number of importance is after que sequence

#                                   vv-- this number
lane8-s244-index--ATTACTCG-TATAGCCT-01_S244_L008_R1_001.fastq
lane8-s245-index--ATTACTCG-ATAGAGGC-02_S245_L008_R1_001.fastq
lane8-s246-index--TCCGGAGA-TATAGCCT-09_S246_L008_R1_001.fastq
lane8-s247-index--TCCGGAGA-ATAGAGGC-10_S247_L008_R1_001.fastq
lane8-s248-index--TCCGGAGA-CCTATCCT-11_S248_L008_R1_001.fastq
lane8-s249-index--TCCGGAGA-GGCTCTGA-12_S249_L008_R1_001.fastq
lane8-s250-index--TCCGGAGA-AGGCGAAG-13_S250_L008_R1_001.fastq
lane8-s251-index--TCCGGAGA-TAATCTTA-14_S251_L008_R1_001.fastq
lane7-s0007-index--ATTACTCG-TATAGCCT-193_S7_L007_R1_001.fastq
lane7-s0008-index--ATTACTCG-ATAGAGGC-105_S8_L007_R1_001.fastq
lane7-s0009-index--ATTACTCG-CCTATCCT-195_S9_L007_R1_001.fastq
lane7-s0010-index--ATTACTCG-GGCTCTGA-106_S10_L007_R1_001.fastq
lane7-s0011-index--ATTACTCG-AGGCGAAG-197_S11_L007_R1_001.fastq
lane7-s0096-index--AGCGATAG-CAGGACGT-287_S96_L007_R1_001.fastq

I have created a file called RENAMING_parse_data.sh that reference RENAMING_parse_data.pl

So in theory the idea is that it is parsing the data to find the sample # that is in the middle of the name, and taking that unique ID and renaming it. But I don't think it's even going into the IF loop. Any ideas?

HERE IS THE .sh file that calls the perl scipt

#!/bin/bash
#first part is the program
#second is the directory path
#third and fourth times are the names of the output files 
#./parse_data.pl /ACTF/Course/PATHTDIRECTORY Tabsummary.txt Strucsummary.txt 
#WHERE  ./parse_data.pl =name of the program 
#WHERE /ACTF/Course/PATHTODIRECTORY  = directory path were your field are saved AND is referred to as $dir_in = $ARGV[0] in the perl script;
#new files you recreating with the extracted data AND is refered to as $dir_in = $ARGV[1];

./RENAMING_parse_data.pl ./Test/ FishList.txt

HERE IS THE PERL SCRIP:

#!/usr/bin/perl
print (":)\n");
#Proesessing files in a directory
$dir_in = $ARGV[0];
$indv_list = $ARGV[1];

#open directory to acess those files, the folder where you have the files
opendir(DIR, $dir_in) || die ("Cannot open $dir_in");
@files = readdir(DIR);

#set all variables = 0 to void chaos
$j=0;

#open output header line for output file and print header line for tab delimited file
open(OUTFILETAB, ">", $indv_list);
print(OUTFILETAB "\t Fish ID", "\t");

#open each file 
foreach (@files){
       #re start all arrays to void chaos
        print("in loop [$j]");
        @acc_ID=(); 
       #find FISH name
        #EXAMPLE FISH NAMES: (lenth of fishname varies)
        #lane8-s251-index--TCCGGAGA-TAATCTTA-14_S251_L008_R1_001.fastq.gz
        #lane7-s0096-index--AGCGATAG-CAGGACGT-287_S96_L007_R1_001.final.fastq
         #NOTE: what is in btween () is the ID that is printed NOTE that value can change from 2 -3 depending on Sample #
        #Trials:
        #lane[0-9]{1}-[a-z]{1}[0-9]{4}-index--[A-Z]{8}[A-Z]{8}-([0-9]{3})[a-z]{1}[0-9]{2}_[A-Z]{1}[0-9]{3}_[a-z]{1}[0-9]{1}_[0-9]{3}.fastq
         #lane[0-9]{1}-[a-z]{1}[0-9]{4}-index--[A-Z]{8}[A-Z]{8}-([0-9]{3})*.fastq
          #lane*([0-9]{3})*.fastq
         #lane.*-([0-9]{2})_.*.fastq
         #lane.*-([0-9]{2})_*.fastq
         #lane[0-9]{1}-[a-z]{1}[0-9]{3}-index--[A-Z]{8}[A-Z]{8}-([0-9]{2})_[A-Z]{1}[0-9]{3}_L008_R1_001.fastq
        $string_FISH = @files;
        if ($string_FISH =~ /^lane[0-9]{1}-[a-z]{1}[0-9]{3}-index--[A-Z]{8}[A-Z]{8}-([0-9]{2})_[A-Z]{1}[0-9]{3}_L008_R1_001.fastq/){
            $FISH_ID =$1;
            @acc_ID[$j] = $FISH_ID;
            #print ("FISH. = |$FISH_ID[$j]| \n");
            rename($string_FISH, "FISH. = |$FISH_ID[$j]|");
            #print ($acc_ID[$j], "\n");
            print(OUTFILETAB "FISH. = |$FISH_ID[$j]| \n");      
        }
    $j= $j+1;
}

IDEAL END RESULT

So in the end I would like it to take the file name, find the unique identifier and rename it

from :

lane8-s244-index--ATTACTCG-TATAGCCT-01_S244_L008_R1_001.fastq
lane7-s0007-index--ATTACTCG-TATAGCCT-193_S7_L007_R1_001.fastq

to:

Fish.01.fastq
Fish.193.fastq

Any Ideas or suggestion on hot to fix this or If it need to change completely are greatly appreciated.

Upvotes: 2

Views: 109

Answers (1)

ikegami
ikegami

Reputation: 385809

At the core of a Perl solution, you could use

s/^.*-(\d+)_[^-]+(?=\.fastq\z)/Fish.$1/sa

For example,

$ ls -1 *.fastq
lane8-s244-index--ATTACTCG-TATAGCCT-01_S244_L008_R1_001.fastq
lane8-s245-index--ATTACTCG-ATAGAGGC-02_S245_L008_R1_001.fastq
lane8-s246-index--TCCGGAGA-TATAGCCT-09_S246_L008_R1_001.fastq
lane8-s247-index--TCCGGAGA-ATAGAGGC-10_S247_L008_R1_001.fastq
lane8-s248-index--TCCGGAGA-CCTATCCT-11_S248_L008_R1_001.fastq
lane8-s249-index--TCCGGAGA-GGCTCTGA-12_S249_L008_R1_001.fastq

$ rename 's/^.*-(\d+)_[^-]+(?=\.fastq\z)/Fish.$1/sa' *.fastq

$ ls -1 *.fastq
Fish.01.fastq
Fish.02.fastq
Fish.09.fastq
Fish.10.fastq
Fish.11.fastq
Fish.12.fastq

(There are two similar tools named rename. This one is also known as prename.)


It's pretty simple to implement yourself:

#!/usr/bin/perl

use strict;
use warnings;

my $errors = 0;
for (@ARGV) {
   my $old = $_;

   s/^.*-(\d+)_[^-]+(?=\.fastq\z)/Fish.$1/sa;

   my $new = $_;

   next if $new eq $old;

   if ( -e $new ) {
      warn( "Can't rename \"$old\" to \"$new\": Already exists\n" );
      ++$errors;
   }
   elsif ( !rename( $old, $new ) ) {
      warn( "Can't rename \"$old\" to \"$new\": $!\n" );
      ++$errors;
   }
}

exit( !!$errors );

Provide the files to rename as arguments (e.g. using *.fastq from the shell).

$ ls -1 *.fastq
lane8-s244-index--ATTACTCG-TATAGCCT-01_S244_L008_R1_001.fastq
lane8-s245-index--ATTACTCG-ATAGAGGC-02_S245_L008_R1_001.fastq
lane8-s246-index--TCCGGAGA-TATAGCCT-09_S246_L008_R1_001.fastq
lane8-s247-index--TCCGGAGA-ATAGAGGC-10_S247_L008_R1_001.fastq
lane8-s248-index--TCCGGAGA-CCTATCCT-11_S248_L008_R1_001.fastq
lane8-s249-index--TCCGGAGA-GGCTCTGA-12_S249_L008_R1_001.fastq

$ ./a *.fastq

$ ls -1 *.fastq
Fish.01.fastq
Fish.02.fastq
Fish.09.fastq
Fish.10.fastq
Fish.11.fastq
Fish.12.fastq

The existence check (-e) is to prevent accidentally renaming a bunch of files to the same name and therefore losing all but one of them.


The above is an cleaned up version of an one-liner pattern I often use.

dir /b ... | perl -nle"$o=$_; s/.../.../; $n=$_; rename$o,$n if!-e$n"

Adapted to sh:

\ls ... | perl -nle'$o=$_; s/.../.../; $n=$_; rename$o,$n if!-e$n'

Upvotes: 3

Related Questions