Reputation: 61
The answer that worked for me and provided the most flexibility is by @M.NejatAydin:
#!/bin/bash
# cd "$1" || exit
FQPATH=$1
OUTPATH=$2
rm $OUTPATH/*
for src in $FQPATH/[^0-9]*.fastq.gz; do
FILENAME=${src##*/}
dst=${FILENAME#*_}
while [[ -e "$OUTPATH/$dst" ]]; do
n=${dst#*_S}
n=$(( ${n%%_*} + 1 ))
dst=${dst%%S*}S${n}_${dst#*_*_}
done
echo "cp -s "$src" "$FQPATH/ren/$dst""
cp -s "$src" "$FQPATH/ren/$dst"
echo 'END'
done
What I wanted
I have the following filenames in a folder:
A006200089_124769_S1_L001_R1_001.fastq.gz
A006200089_124769_S1_L001_R2_001.fastq.gz
A006200089_124771_S2_L001_R1_001.fastq.gz
A006200089_124771_S2_L001_R2_001.fastq.gz
A006850080_124769_S1_L001_R1_001.fastq.gz
A006850080_124769_S1_L001_R2_001.fastq.gz
A006850080_124769_S1_L002_R1_001.fastq.gz
A006850080_124769_S1_L002_R2_001.fastq.gz
A006850080_124771_S2_L001_R1_001.fastq.gz
A006850080_124771_S2_L001_R2_001.fastq.gz
A006850080_124771_S2_L002_R1_001.fastq.gz
A006850080_124771_S2_L002_R2_001.fastq.gz
Those have the following characteristics:
identifier_sampleName(integer)_S[1-100]_R[1-3]_001.fastq.gz
separated by _
.
In a following step the $identifier
will be deleted the filename will be trimmed to:
124771_S2_L002_R2_001.fastq.gz
The problem comes from the possibility of some of those entries to end up with identical filename:
A006200089_124769_S1_L001_R1_001.fastq.gz --> 124769_S1_L001_R1_001.fastq.gz
A006850080_124769_S1_L001_R1_001.fastq.gz --> 124769_S1_L001_R1_001.fastq.gz
What I want is
A006200089_124769_S1_L001_R1_001.fastq.gz --> 124769_S1_L001_R1_001.fastq.gz
A006200089_124769_S1_L001_R2_001.fastq.gz --> 124769_S1_L001_R2_001.fastq.gz
A006850080_124769_S1_L001_R1_001.fastq.gz --> 124769_S2_L001_R1_001.fastq.gz
A006850080_124769_S1_L001_R2_001.fastq.gz --> 124769_S2_L001_R2_001.fastq.gz
A006850080_124769_S1_L002_R1_001.fastq.gz --> 124769_S2_L002_R1_001.fastq.gz
A006850080_124769_S1_L002_R2_001.fastq.gz --> 124769_S2_L002_R2_001.fastq.gz
When there are just a few samples I am using the following code:
#!/bin/bash -l
for i in $1/A006850080*.fastq.gz
do
DIR=${i%/*}
base1=${i##*/}
NOEXT=${base1%.*}
NOEXT1=${NOEXT%.*}
A="$(echo $NOEXT1 | cut -d'_' -f1)"
B="$(echo $NOEXT1 | cut -d'_' -f2)"
C="$(echo $NOEXT1 | cut -d'_' -f3)"
D="$(echo $NOEXT1 | cut -d'_' -f4)"
E="$(echo $NOEXT1 | cut -d'_' -f5)"
F="$(echo $NOEXT1 | cut -d'_' -f6)"
SNUM=(${C:1})
NUM=$((SNUM+1))
mv $DIR/$base1 $DIR/$A"_"$B"_S"$NUM"_"$D"_"$E"_"$F".fastq.gz"
done
NUM=$((SNUM+1))
: in this line I have counted the occurrences of the A006200089_124769* filename and increased the S[1-100] part by that number.
This code is not enough if
more occurrences will be there:
A006850069_124769_S1_L001_R1_001.fastq.gz
A006850075_124769_S1_L001_R1_001.fastq.gz
A006200089_124769_S1_L001_R1_001.fastq.gz
A006850080_124769_S1_L001_R1_001.fastq.gz
more $sampleName
(could be in the range of 100s)
Is there a way to parse all files of the same $sampleName
and change the S[1-100] part so that no files will be overwritten?
Thank you in advance
Upvotes: 0
Views: 156
Reputation: 15293
One more way to do it...
for f in A*.fastq.gz; # edited for idempotence
do new=${f#*_}; # remove the leading field
i=1; # initialize the version counter
while [[ -e "$new" ]]; # while the new filename already exists
do printf -v n %03d $((++i)); # increment and format the counter
new=${new%_*}_$n.fastq.gz; # and use it in the new filename
done; # will exit when it finds an unused name
mv $f $new; # and move the file to that name
done
Upvotes: 2
Reputation: 10123
Here is an implementation in plain bash:
$ cat /tmp/rename
#!/bin/bash
cd "$1" || exit
for src in [^0-9]*.fastq.gz; do
dst=${src#*_}
while [[ -e $dst ]]; do
n=${dst#*_S}
n=$(( ${n%%_*} + 1 ))
dst=${dst%%S*}S${n}_${dst#*_*_}
done
mv ./"$src" ./"$dst"
done
Test:
$ mkdir /tmp/test
$ cd /tmp/test
$ touch A00620008{0,9}_124769_S1_L00{1,2}_R{1,2}_001.fastq.gz
$ ls -1
A006200080_124769_S1_L001_R1_001.fastq.gz
A006200080_124769_S1_L001_R2_001.fastq.gz
A006200080_124769_S1_L002_R1_001.fastq.gz
A006200080_124769_S1_L002_R2_001.fastq.gz
A006200089_124769_S1_L001_R1_001.fastq.gz
A006200089_124769_S1_L001_R2_001.fastq.gz
A006200089_124769_S1_L002_R1_001.fastq.gz
A006200089_124769_S1_L002_R2_001.fastq.gz
$ /tmp/rename /tmp/test
$ ls -1
124769_S1_L001_R1_001.fastq.gz
124769_S1_L001_R2_001.fastq.gz
124769_S1_L002_R1_001.fastq.gz
124769_S1_L002_R2_001.fastq.gz
124769_S2_L001_R1_001.fastq.gz
124769_S2_L001_R2_001.fastq.gz
124769_S2_L002_R1_001.fastq.gz
124769_S2_L002_R2_001.fastq.gz
Upvotes: 1
Reputation: 392
you might work around that :
#!/bin/bash
#-xe for debug
# to adapt this is not the solution but a minimal work around
count=0
set -- *.gz
while (($#)); do
mv -- "${1}" $(echo ${1} | sed 's/^[A-Z][0-9]\{9\}_//;s/L.../L'$(printf "%03d" ${count})'/')
shift
count=$(( count + 1))
done
you should add at least a condition if file exist or better an error message management just before applying the mv
command.
how it works is substitution of useless name part starting by A & 9 zeros & underscore by nothing & then it substitute L follow by 3 characters by L and a formatted counting number
number increases of course formatting counter is necessary to avoid getting a1.txt instead of a001.txt
of course this is not a full solution you have to adapt it to your needs.
# ls
A006200089_124769_S1_L001_R1_001.fastq.gz A006200089_124771_S2_L001_R2_001.fastq.gz A006850080_124769_S1_L002_R1_001.fastq.gz A006850080_124771_S2_L001_R2_001.fastq.gz test.sh
A006200089_124769_S1_L001_R2_001.fastq.gz A006850080_124769_S1_L001_R1_001.fastq.gz A006850080_124769_S1_L002_R2_001.fastq.gz A006850080_124771_S2_L002_R1_001.fastq.gz
A006200089_124771_S2_L001_R1_001.fastq.gz A006850080_124769_S1_L001_R2_001.fastq.gz A006850080_124771_S2_L001_R1_001.fastq.gz A006850080_124771_S2_L002_R2_001.fastq.gz
# ./test.sh
# ls -lrth
total 4.0K
-rwxr-xr-x 1 root root 257 Jul 24 19:02 test.sh
-rw-r--r-- 1 root root 0 Jul 24 19:17 124771_S2_L003_R2_001.fastq.gz
-rw-r--r-- 1 root root 0 Jul 24 19:17 124771_S2_L002_R1_001.fastq.gz
-rw-r--r-- 1 root root 0 Jul 24 19:17 124769_S1_L004_R1_001.fastq.gz
-rw-r--r-- 1 root root 0 Jul 24 19:17 124769_S1_L001_R2_001.fastq.gz
-rw-r--r-- 1 root root 0 Jul 24 19:17 124769_S1_L000_R1_001.fastq.gz
-rw-r--r-- 1 root root 0 Jul 24 19:17 124771_S2_L009_R2_001.fastq.gz
-rw-r--r-- 1 root root 0 Jul 24 19:17 124771_S2_L008_R1_001.fastq.gz
-rw-r--r-- 1 root root 0 Jul 24 19:17 124769_S1_L007_R2_001.fastq.gz
-rw-r--r-- 1 root root 0 Jul 24 19:17 124769_S1_L006_R1_001.fastq.gz
-rw-r--r-- 1 root root 0 Jul 24 19:17 124769_S1_L005_R2_001.fastq.gz
-rw-r--r-- 1 root root 0 Jul 24 19:17 124771_S2_L011_R2_001.fastq.gz
-rw-r--r-- 1 root root 0 Jul 24 19:17 124771_S2_L010_R1_001.fastq.gz
consider also using many variable instead of a long sub-command in the sed applied here
Upvotes: 1