Reputation: 31
I have a directory containing 60k subdirectories.
Each subdirectory has two (sub-)subdirectories (2_4
, cov_bound
). What I am interested in is subdirectory 2_4
. So it looks like this:
main_directory/foo/2_4/
Each subdirectory foo
contains one .pdb
file.
Each 2_4
subsubdirectory contains 0 or more .txt
files.
So it looks like this:
main_directory/foo/1A2C.pdb
main_directory/foo/2_4/XLS#A#207.txt
main_directory/foo/2_4/XLS#B#209.txt
main_directory/foo/2_4/XLS#C#207.txt
main_directory/foo/2_4/SOS#D#145.txt
I am trying to join the letters before the first #
in the filename (XLS
, SOS
in this example) into the file name of the pdb file:
1A2C_XLS_SOS.pdb
Multiple files start with XLS#
, but each prefix should only be used once.
The second problem that I was encountering is that if the subdirectory 2_4
is empty, it gives an output as 1A2C_.pdb
and I want to get rid of this. So if 2_4
is empty, don't process it. Just run it on the 2_4
subdirectories that have .txt
files.
I was trying to write something in bash but this works only for one .txt
file in 2_4
and it also takes into account empty 2_4
.
All these pdb files with new filenames should be copied into another directory.
I have tried this in bash:
mkdir pdb_files
for i in */ ; do cd $i ; pwd ; a=`head -n1 2_4/*txt | awk '{print $4}' ` ; j=`ls *pdb` ; cp $j ../pdb_files/${j//.pdb/_$a}.pdb ; cd ../ ; done
I was running it from the main_directory
.
Upvotes: 1
Views: 86
Reputation: 1448
Here's a crack at it:
TARGETDIR=./pdb_files
for dir in $(find . -maxdepth 1 -type d -not -name .)
do
PREFIXES=( $(
for file in ${dir}/2_4/*.txt
do
filename=$(basename $file)
echo ${filename%%\#*.txt}
done | sort -u
) )
if [ ${PREFIXES[0]} != '*.txt' ]
then
for oldpdb in ${dir}/*.pdb
do
pdbname=${oldpdb%%.pdb}
pdbsuffix=$(IFS=_ ; echo "${PREFIXES[*]}")
newpdb=${TARGET}/$(basename $pdbname)_${pdbsuffix}.pdb
echo -------------------------
echo Directory: $dir
echo Old file name: $oldpdb
echo New file name: $newpdb
# I think this is what you want?
cp $oldpdb $newpdb
done
else
for oldpdb in ${dir}/*.pdb
do
echo -------------------------
echo Directory: $dir
echo Old file name: $oldpdb
echo New file name: do not rename file
# maybe you want to copy unmodified files?
# cp $oldpdb $TARGET
done
fi
done
This is my directory structure, from main_directory:
main_directory/
├── foo
│ ├── 1A2C.pdb
│ └── 2_4
│ ├── SOS#D#145.txt
│ ├── XLS#A#207.txt
│ ├── XLS#B#209.txt
│ └── XLS#C#207.txt
├── foo2
│ ├── 1A2B.pdb
│ └── 2_4
└── run
And, the output from './run' in main_directory:
-------------------------
Directory: ./foo2
Old file name:
New file name: do not rename file
-------------------------
Directory: ./foo
Old file name: ./foo/1A2C.pdb
New file name: ./foo/1A2C_SOS_XLS.pdb
EDIT: I had missed the "copy these new files to some other directory" bit, so I tweaked the script a bit.
EDIT 2: Kudos to melpomene; sorry to steal your question, by the time I got here you had cleaned it up well enough that it made sense. Sorry :(
EDIT 3: Well, it isn't super clear exactly what you want to do with these files. The hard part seemed to bet getting the correct new file name, I think this code does that. I've changed it so it just prints out the old and new PDB filenames, from there you can insert your own logic to rename or copy them around as you see fit? This script doesn't cd
into each directory, it does everything from the main directory. Please give it a try with a subset of your data and let me know how it goes.
EDIT 4: Changed logic on where we move the new file
EDIT 5: Put some problematic code here. Here's code without all of the debug / attempt to demonstrate the inner workings of the script. This will do what you're after and not talk about what it's doing along the way.
EDIT 6: My previous code didn't work for directories that had one .txt file. This will work for all use cases I'm aware of.
#!/bin/bash
TARGETDIR=./pdb_files
for dir in $(find . -maxdepth 1 -type d -not -name .)
do
PREFIXES=( $(
for file in ${dir}/2_4/*.txt
do
filename=$(basename $file)
echo ${filename%%\#*.txt}
done | sort -u
) )
if [ ${#PREFIXES[@]} -ge 1 -a "${PREFIXES[0]}" != '*.txt' ]
then
for oldpdb in ${dir}/*.pdb
do
pdbname=${oldpdb%%.pdb}
pdbsuffix=$(IFS=_ ; echo "${PREFIXES[*]}")
newpdb=${TARGETDIR}/$(basename $pdbname)_${pdbsuffix}.pdb
cp $oldpdb $newpdb
done
fi
done
Tree before:
main_directory
├── foo
│ ├── 1A2C.pdb
│ └── 2_4
│ ├── SOS#D#145.txt
│ ├── XLS#A#207.txt
│ ├── XLS#B#209.txt
│ └── XLS#C#207.txt
├── foo2
│ ├── 1A2B.pdb
│ └── 2_4
├── foo3
│ ├── 2_4
│ │ └── XLS#C#100.csv
│ └── 2A3B.pdb
├── foo4
│ ├── 2_4
│ │ └── XLS#D#201.txt
│ └── 3A3B.pdb
├── pdb_files
└── run
After:
main_directory
├── foo
│ ├── 1A2C.pdb
│ └── 2_4
│ ├── SOS#D#145.txt
│ ├── XLS#A#207.txt
│ ├── XLS#B#209.txt
│ └── XLS#C#207.txt
├── foo2
│ ├── 1A2B.pdb
│ └── 2_4
├── foo3
│ ├── 2_4
│ │ └── XLS#C#100.csv
│ └── 2A3B.pdb
├── foo4
│ ├── 2_4
│ │ └── XLS#D#201.txt
│ └── 3A3B.pdb
├── pdb_files
│ ├── 1A2C_SOS_XLS.pdb
│ └── 3A3B_XLS.pdb
└── run
Upvotes: 1