sunta3iouxos
sunta3iouxos

Reputation: 61

Rename files giving an increased numeric value based on previous occurrences, using bash

The answer that worked for me and provided the most flexibility is by @M.NejatAydin:

#!/bin/bash
# cd "$1" || exit

FQPATH=$1
OUTPATH=$2
rm $OUTPATH/*
for src in $FQPATH/[^0-9]*.fastq.gz; do
        FILENAME=${src##*/}
        dst=${FILENAME#*_}
        while [[ -e "$OUTPATH/$dst" ]]; do
                n=${dst#*_S}
                n=$(( ${n%%_*} + 1 ))
                dst=${dst%%S*}S${n}_${dst#*_*_}
        done
        echo "cp -s  "$src" "$FQPATH/ren/$dst""
        cp -s  "$src" "$FQPATH/ren/$dst"
echo 'END'
done

What I wanted

I have the following filenames in a folder:

A006200089_124769_S1_L001_R1_001.fastq.gz
A006200089_124769_S1_L001_R2_001.fastq.gz
A006200089_124771_S2_L001_R1_001.fastq.gz
A006200089_124771_S2_L001_R2_001.fastq.gz
A006850080_124769_S1_L001_R1_001.fastq.gz
A006850080_124769_S1_L001_R2_001.fastq.gz
A006850080_124769_S1_L002_R1_001.fastq.gz
A006850080_124769_S1_L002_R2_001.fastq.gz
A006850080_124771_S2_L001_R1_001.fastq.gz
A006850080_124771_S2_L001_R2_001.fastq.gz
A006850080_124771_S2_L002_R1_001.fastq.gz
A006850080_124771_S2_L002_R2_001.fastq.gz

Those have the following characteristics:

identifier_sampleName(integer)_S[1-100]_R[1-3]_001.fastq.gz

separated by _.

In a following step the $identifier will be deleted the filename will be trimmed to:

124771_S2_L002_R2_001.fastq.gz

The problem comes from the possibility of some of those entries to end up with identical filename:

A006200089_124769_S1_L001_R1_001.fastq.gz --> 124769_S1_L001_R1_001.fastq.gz
A006850080_124769_S1_L001_R1_001.fastq.gz --> 124769_S1_L001_R1_001.fastq.gz

What I want is

A006200089_124769_S1_L001_R1_001.fastq.gz --> 124769_S1_L001_R1_001.fastq.gz
A006200089_124769_S1_L001_R2_001.fastq.gz --> 124769_S1_L001_R2_001.fastq.gz
A006850080_124769_S1_L001_R1_001.fastq.gz --> 124769_S2_L001_R1_001.fastq.gz
A006850080_124769_S1_L001_R2_001.fastq.gz --> 124769_S2_L001_R2_001.fastq.gz
A006850080_124769_S1_L002_R1_001.fastq.gz --> 124769_S2_L002_R1_001.fastq.gz
A006850080_124769_S1_L002_R2_001.fastq.gz --> 124769_S2_L002_R2_001.fastq.gz

When there are just a few samples I am using the following code:

#!/bin/bash -l

 for i in $1/A006850080*.fastq.gz
do
 DIR=${i%/*}
 base1=${i##*/}
 NOEXT=${base1%.*}
 NOEXT1=${NOEXT%.*}
    
 A="$(echo $NOEXT1 | cut -d'_' -f1)"
 B="$(echo $NOEXT1 | cut -d'_' -f2)"
 C="$(echo $NOEXT1 | cut -d'_' -f3)"
 D="$(echo $NOEXT1 | cut -d'_' -f4)"
 E="$(echo $NOEXT1 | cut -d'_' -f5)"
 F="$(echo $NOEXT1 | cut -d'_' -f6)"

SNUM=(${C:1})
NUM=$((SNUM+1))
mv $DIR/$base1 $DIR/$A"_"$B"_S"$NUM"_"$D"_"$E"_"$F".fastq.gz"
done

NUM=$((SNUM+1)): in this line I have counted the occurrences of the A006200089_124769* filename and increased the S[1-100] part by that number.

This code is not enough if

Is there a way to parse all files of the same $sampleName and change the S[1-100] part so that no files will be overwritten?

Thank you in advance

Upvotes: 0

Views: 156

Answers (3)

Paul Hodges
Paul Hodges

Reputation: 15293

One more way to do it...

for f in A*.fastq.gz;             # edited for idempotence
do new=${f#*_};                   # remove the leading field
   i=1;                           # initialize the version counter
   while [[ -e "$new" ]];         # while the new filename already exists
   do printf -v n %03d $((++i));  # increment and format the counter
      new=${new%_*}_$n.fastq.gz;  # and use it in the new filename
   done;                          # will exit when it finds an unused name
   mv $f $new;                    # and move the file to that name
done  

Upvotes: 2

M. Nejat Aydin
M. Nejat Aydin

Reputation: 10123

Here is an implementation in plain bash:

$ cat /tmp/rename

#!/bin/bash

cd "$1" || exit

for src in [^0-9]*.fastq.gz; do
    dst=${src#*_}
    while [[ -e $dst ]]; do
        n=${dst#*_S}
        n=$(( ${n%%_*} + 1 ))
        dst=${dst%%S*}S${n}_${dst#*_*_}
    done
    mv  ./"$src" ./"$dst"
done

Test:

$ mkdir /tmp/test
$ cd /tmp/test
$ touch A00620008{0,9}_124769_S1_L00{1,2}_R{1,2}_001.fastq.gz
$ ls -1
A006200080_124769_S1_L001_R1_001.fastq.gz
A006200080_124769_S1_L001_R2_001.fastq.gz
A006200080_124769_S1_L002_R1_001.fastq.gz
A006200080_124769_S1_L002_R2_001.fastq.gz
A006200089_124769_S1_L001_R1_001.fastq.gz
A006200089_124769_S1_L001_R2_001.fastq.gz
A006200089_124769_S1_L002_R1_001.fastq.gz
A006200089_124769_S1_L002_R2_001.fastq.gz
$ /tmp/rename /tmp/test
$ ls -1
124769_S1_L001_R1_001.fastq.gz
124769_S1_L001_R2_001.fastq.gz
124769_S1_L002_R1_001.fastq.gz
124769_S1_L002_R2_001.fastq.gz
124769_S2_L001_R1_001.fastq.gz
124769_S2_L001_R2_001.fastq.gz
124769_S2_L002_R1_001.fastq.gz
124769_S2_L002_R2_001.fastq.gz

Upvotes: 1

francois P
francois P

Reputation: 392

you might work around that :

#!/bin/bash
#-xe for debug

# to adapt this is not the solution but a minimal work around

count=0
set -- *.gz
while (($#)); do
        mv -- "${1}" $(echo ${1} | sed 's/^[A-Z][0-9]\{9\}_//;s/L.../L'$(printf "%03d" ${count})'/')
        shift
        count=$(( count + 1))
done

you should add at least a condition if file exist or better an error message management just before applying the mv command.

how it works is substitution of useless name part starting by A & 9 zeros & underscore by nothing & then it substitute L follow by 3 characters by L and a formatted counting number

number increases of course formatting counter is necessary to avoid getting a1.txt instead of a001.txt

of course this is not a full solution you have to adapt it to your needs.

# ls
A006200089_124769_S1_L001_R1_001.fastq.gz  A006200089_124771_S2_L001_R2_001.fastq.gz  A006850080_124769_S1_L002_R1_001.fastq.gz  A006850080_124771_S2_L001_R2_001.fastq.gz  test.sh
A006200089_124769_S1_L001_R2_001.fastq.gz  A006850080_124769_S1_L001_R1_001.fastq.gz  A006850080_124769_S1_L002_R2_001.fastq.gz  A006850080_124771_S2_L002_R1_001.fastq.gz
A006200089_124771_S2_L001_R1_001.fastq.gz  A006850080_124769_S1_L001_R2_001.fastq.gz  A006850080_124771_S2_L001_R1_001.fastq.gz  A006850080_124771_S2_L002_R2_001.fastq.gz
# ./test.sh 
# ls -lrth 
total 4.0K
-rwxr-xr-x 1 root root 257 Jul 24 19:02 test.sh
-rw-r--r-- 1 root root   0 Jul 24 19:17 124771_S2_L003_R2_001.fastq.gz
-rw-r--r-- 1 root root   0 Jul 24 19:17 124771_S2_L002_R1_001.fastq.gz
-rw-r--r-- 1 root root   0 Jul 24 19:17 124769_S1_L004_R1_001.fastq.gz
-rw-r--r-- 1 root root   0 Jul 24 19:17 124769_S1_L001_R2_001.fastq.gz
-rw-r--r-- 1 root root   0 Jul 24 19:17 124769_S1_L000_R1_001.fastq.gz
-rw-r--r-- 1 root root   0 Jul 24 19:17 124771_S2_L009_R2_001.fastq.gz
-rw-r--r-- 1 root root   0 Jul 24 19:17 124771_S2_L008_R1_001.fastq.gz
-rw-r--r-- 1 root root   0 Jul 24 19:17 124769_S1_L007_R2_001.fastq.gz
-rw-r--r-- 1 root root   0 Jul 24 19:17 124769_S1_L006_R1_001.fastq.gz
-rw-r--r-- 1 root root   0 Jul 24 19:17 124769_S1_L005_R2_001.fastq.gz
-rw-r--r-- 1 root root   0 Jul 24 19:17 124771_S2_L011_R2_001.fastq.gz
-rw-r--r-- 1 root root   0 Jul 24 19:17 124771_S2_L010_R1_001.fastq.gz

consider also using many variable instead of a long sub-command in the sed applied here

Upvotes: 1

Related Questions