sergio
sergio

Reputation: 31

Grepping filenames from .txt files that are located in subdirectory and joining them with filename of a file that is located in directory

I have a directory containing 60k subdirectories. Each subdirectory has two (sub-)subdirectories (2_4, cov_bound). What I am interested in is subdirectory 2_4. So it looks like this:

main_directory/foo/2_4/

Each subdirectory foo contains one .pdb file.
Each 2_4 subsubdirectory contains 0 or more .txt files.

So it looks like this:

main_directory/foo/1A2C.pdb
main_directory/foo/2_4/XLS#A#207.txt
main_directory/foo/2_4/XLS#B#209.txt
main_directory/foo/2_4/XLS#C#207.txt
main_directory/foo/2_4/SOS#D#145.txt

I am trying to join the letters before the first # in the filename (XLS, SOS in this example) into the file name of the pdb file:

1A2C_XLS_SOS.pdb

Multiple files start with XLS#, but each prefix should only be used once.

The second problem that I was encountering is that if the subdirectory 2_4 is empty, it gives an output as 1A2C_.pdb and I want to get rid of this. So if 2_4 is empty, don't process it. Just run it on the 2_4 subdirectories that have .txt files.

I was trying to write something in bash but this works only for one .txt file in 2_4 and it also takes into account empty 2_4.

All these pdb files with new filenames should be copied into another directory.

I have tried this in bash:

mkdir pdb_files
for i in */ ; do cd $i ; pwd ; a=`head -n1 2_4/*txt | awk '{print $4}' ` ; j=`ls *pdb` ;  cp $j ../pdb_files/${j//.pdb/_$a}.pdb ; cd ../ ; done

I was running it from the main_directory.

Upvotes: 1

Views: 86

Answers (1)

Jeff Breadner
Jeff Breadner

Reputation: 1448

Here's a crack at it:

TARGETDIR=./pdb_files
for dir in $(find . -maxdepth 1 -type d -not -name .)
do
  PREFIXES=( $(
    for file in ${dir}/2_4/*.txt
    do
      filename=$(basename $file)
      echo ${filename%%\#*.txt}
    done | sort -u 
  ) )

  if [ ${PREFIXES[0]} != '*.txt' ]
  then
    for oldpdb in ${dir}/*.pdb
    do
      pdbname=${oldpdb%%.pdb}
      pdbsuffix=$(IFS=_ ; echo "${PREFIXES[*]}")
      newpdb=${TARGET}/$(basename $pdbname)_${pdbsuffix}.pdb
      echo -------------------------
      echo Directory: $dir
      echo Old file name: $oldpdb
      echo New file name: $newpdb
      # I think this is what you want?
      cp $oldpdb $newpdb
    done
  else
    for oldpdb in ${dir}/*.pdb
    do
      echo -------------------------
      echo Directory: $dir
      echo Old file name: $oldpdb
      echo New file name: do not rename file
      # maybe you want to copy unmodified files?
      # cp $oldpdb $TARGET
    done
  fi
done

This is my directory structure, from main_directory:

main_directory/
├── foo
│   ├── 1A2C.pdb
│   └── 2_4
│       ├── SOS#D#145.txt
│       ├── XLS#A#207.txt
│       ├── XLS#B#209.txt
│       └── XLS#C#207.txt
├── foo2
│   ├── 1A2B.pdb
│   └── 2_4
└── run

And, the output from './run' in main_directory:

-------------------------
Directory: ./foo2
Old file name:
New file name: do not rename file
-------------------------
Directory: ./foo
Old file name: ./foo/1A2C.pdb
New file name: ./foo/1A2C_SOS_XLS.pdb

EDIT: I had missed the "copy these new files to some other directory" bit, so I tweaked the script a bit.

EDIT 2: Kudos to melpomene; sorry to steal your question, by the time I got here you had cleaned it up well enough that it made sense. Sorry :(

EDIT 3: Well, it isn't super clear exactly what you want to do with these files. The hard part seemed to bet getting the correct new file name, I think this code does that. I've changed it so it just prints out the old and new PDB filenames, from there you can insert your own logic to rename or copy them around as you see fit? This script doesn't cd into each directory, it does everything from the main directory. Please give it a try with a subset of your data and let me know how it goes.

EDIT 4: Changed logic on where we move the new file

EDIT 5: Put some problematic code here. Here's code without all of the debug / attempt to demonstrate the inner workings of the script. This will do what you're after and not talk about what it's doing along the way.

EDIT 6: My previous code didn't work for directories that had one .txt file. This will work for all use cases I'm aware of.

#!/bin/bash

TARGETDIR=./pdb_files

for dir in $(find . -maxdepth 1 -type d -not -name .)
do
  PREFIXES=( $(
    for file in ${dir}/2_4/*.txt
    do
      filename=$(basename $file)
      echo ${filename%%\#*.txt}
    done | sort -u
  ) )

  if [ ${#PREFIXES[@]} -ge 1 -a "${PREFIXES[0]}" != '*.txt' ]
  then
    for oldpdb in ${dir}/*.pdb
    do
      pdbname=${oldpdb%%.pdb}
      pdbsuffix=$(IFS=_ ; echo "${PREFIXES[*]}")
      newpdb=${TARGETDIR}/$(basename $pdbname)_${pdbsuffix}.pdb
      cp $oldpdb $newpdb
    done
  fi
done

Tree before:

main_directory
├── foo
│   ├── 1A2C.pdb
│   └── 2_4
│       ├── SOS#D#145.txt
│       ├── XLS#A#207.txt
│       ├── XLS#B#209.txt
│       └── XLS#C#207.txt
├── foo2
│   ├── 1A2B.pdb
│   └── 2_4
├── foo3
│   ├── 2_4
│   │   └── XLS#C#100.csv
│   └── 2A3B.pdb
├── foo4
│   ├── 2_4
│   │   └── XLS#D#201.txt
│   └── 3A3B.pdb
├── pdb_files
└── run

After:

main_directory
├── foo
│   ├── 1A2C.pdb
│   └── 2_4
│       ├── SOS#D#145.txt
│       ├── XLS#A#207.txt
│       ├── XLS#B#209.txt
│       └── XLS#C#207.txt
├── foo2
│   ├── 1A2B.pdb
│   └── 2_4
├── foo3
│   ├── 2_4
│   │   └── XLS#C#100.csv
│   └── 2A3B.pdb
├── foo4
│   ├── 2_4
│   │   └── XLS#D#201.txt
│   └── 3A3B.pdb
├── pdb_files
│   ├── 1A2C_SOS_XLS.pdb
│   └── 3A3B_XLS.pdb
└── run

Upvotes: 1

Related Questions