user4532841
user4532841

Reputation:

Recursively concatenating (joining) and renaming text files in a directory tree

I am using a Mac OS X Lion.

I have a folder: LITERATURE with the following structure:

LITERATURE > Y > YATES, DORNFORD > THE BROTHER OF DAPHNE:
  Chapters 01-05.txt
  Chapters 06-10.txt
  Chapters 11-end.txt

I want to recursively concatenate the chapters that are split into multiple files (not all are). Then, I want to write the concatenated file to its parent's parent directory. The name of the concatenated file should be the same as the name of its parent directory.

For example, after running the script (in the folder structure shown above) I should get the following.

LITERATURE > Y > YATES, DORNFORD:
  THE BROTHER OF DAPHNE.txt
  THE BROTHER OF DAPHNE:
    Chapters 01-05.txt
    Chapters 06-10.txt
    Chapters 11-end.txt

In this example, the parent directory is THE BROTHER OF DAPHNE and the parent's parent directory is YATES, DORNFORD.


[Updated March 6th—Rephrased the question/answer so that the question/answer is easy to find and understand.]

Upvotes: 0

Views: 400

Answers (4)

user4532841
user4532841

Reputation:

Thanks for all your input. They got me thinking, and I managed to concatenate the files using the following steps:


  1. This script replaces spaces in filenames with underscores.

 

#!/bin/bash

# We are going to iterate through the directory tree, up to a maximum depth of 20.
for i in `seq 1 20`
  do

# In UNIX based systems, files and directories are the same (Everything is a File!).
# The 'find' command lists all files which contain spaces in its name. The | (pipe) …
# … forwards the list to a 'while' loop that iterates through each file in the list.
    find . -name '* *' -maxdepth $i | while read file
    do

# Here, we use 'sed' to replace spaces in the filename with underscores.
# The 'echo' prints a message to the console before renaming the file using 'mv'.
      item=`echo "$file" | sed 's/ /_/g'`
      echo "Renaming '$file' to '$item'"
      mv "$file" "$item"
    done
done

  1. This script concatenates text files that start with Part, Chapter, Section, or Book.

 

#!/bin/bash

# Here, we go through all the directories (up to a depth of 20).
for D in `find . -maxdepth 20 -type d`
do

# Check if the parent directory contains any files of interest.
    if ls $D/Part*.txt &>/dev/null ||
       ls $D/Chapter*.txt &>/dev/null ||
       ls $D/Section*.txt &>/dev/null ||
       ls $D/Book*.txt &>/dev/null
      then

# If we get here, then there are split files in the directory; we will concatenate them.
# First, we trim the full directory path ($D) so that we are left with the path to the …
# … files' parent's parent directory—We will write the concatenated file here. (✝)
        ppdir="$(dirname "$D")"

# Here, we concatenate the files using 'cat'. The 'awk' command extracts the name of …
# … the parent directory from the full directory path ($D) and gives us the filename.
# Finally, we write the concatenated file to its parent's parent directory. (✝)
        cat $D/*.txt > $ppdir/`echo $D|awk -F'/' '$0=$(NF-0)'`.txt
    fi
done

  1. Now, we delete all the files that we concatenated so that its parent directory is left empty.

    • find . -name 'Part*' -delete
    • find . -name 'Chapter*' -delete
    • find . -name 'Section*' -delete
    • find . -name 'Book*' -delete

  1. The following command will delete empty directories. (✝) We wrote the concatenated file to its parent's parent directory so that its parent directory is left empty after deleting all the split files.

    • find . -type d -empty -delete

[Updated March 6th—Rephrased the question/answer so that the question/answer is easy to find and understand.]

Upvotes: 0

David W.
David W.

Reputation: 107040

Shell doesn't like white space in names. However, over the years, Unix has come up with some tricks that'll help:

$ find . -name "Chapters*.txt" -type f -print0 | xargs -0 cat >> final_file.txt

Might do what you want.

The find recursively finds all of the directory entries in a file tree that matches the query (In this case, the type must be a file, and the name matches the pattern Chapter*.txt).

Normally, find separates out the directory entry names with NL, but the -print0 says to separate out the entries names with the NUL character. The NL is a valid character in a file name, but NUL isn't.

The xargs command takes the output of the find and processes it. xargs gathers all the names and passes them in bulk to the command you give it -- in this case the cat command.

Normally, xargs separates out files by white space which means Chapters would be one file and 01-05.txt would be another. However, the -0 tells xargs, to use NUL as a file separator -- which is what -print0 does.

Upvotes: 0

NeronLeVelu
NeronLeVelu

Reputation: 10039

cat Chapters*.txt > FinaleFile.txt.raw
Chapters="$( ls -1 Chapters*.txt | sed -n 'H;${x;s/\
//g;s/ *Chapters //g;s/\.txt/ /g;s/ *$//p;}' )"
mv FinaleFile.txt.raw "FinaleFile ${Chapters}.txt"
  • cat all txt at once (assuming name sorted list)
  • take chapter number/ref from the ls of the folder and with a sed to adapt the format
  • rename the concatenate file including chapters

Upvotes: 0

tripleee
tripleee

Reputation: 189447

It's not clear what you mean by "recursively" but this should be enough to get you started.

#!/bin/bash

titlecase () {  # adapted from http://stackoverflow.com/a/6969886/874188
    local arr
    arr=("${@,,}")
    echo "${arr[@]^}"
}

for book in LITERATURE/?/*/*; do
    title=$(titlecase ${book##*/})
    for file in "$book"/*; do
        cat "$file"
        echo
    done >"$book/$title"
    echo '# not doing this:' rm "$book"/*.txt
done

This loops over LITERATURE/initial/author/BOOK TITLE and creates a file Book Title (where should a space be added?) from the catenated files in each book directory. (I would generate it in the parent directory and then remove the book directory completely, assuming it contains nothing of value any longer.) There is no recursion, just a loop over this directory structure.

Removing the chapter files is a bit risky so I'm not doing it here. You could remove the echo prefix from the line after the first done to enable it.

If you have book names which contain an asterisk or some other shell metacharacter this will be rather more complex -- the title assignment assumes you can use the book title unquoted.

Only the parameter expansion with case conversion is beyond the very basics of Bash. The array operations could perhaps also be a bit scary if you are a complete beginner. Proper understanding of quoting is also often a challenge for newcomers.

Upvotes: 1

Related Questions