user971102
user971102

Reputation: 3075

Split files according to a field and save in subdirectory created using the root name

I am having trouble with several bits of code, I am no expert in Linux Bash programming unfortunately so I have tried unsuccessfully to find something that works for my task all day and was hoping you could help guide me in the right direction.

I have many large files that I would like to split according to the third field within each of them, I would like to keep the header in each of the sub-files, and save the created sub-files in new directories created from the root names of the files.

The initial files stored in the original directory are:

Downloads/directory1/Levels_CHG_Lab_S_sample1.txt
Downloads/directory1/Levels_CHG_Lab_S_sample2.txt
Downloads/directory1/Levels_CHG_Lab_S_sample3.txt

and so on..

Each of these files have 200 columns, and column 3 contains values from 1 through 10. I would like to split each of the files above based on the value of this column, and store the subfiles in subfolders, so for example sub-folder "Downloads/directory1/sample1" will contain 10 files (with the header line) derived by splitting the file Downloads/directory1/Levels_CHG_Lab_S_sample1.txt.

I have tried now many different steps for these steps, with no success.. I must be making this more complicated than it is since the code I have tried looks aweful… Here is the code I am trying to work from:

FILES=Downloads/directory1/

for f in $FILES
  do
    # Create folder with root name by stripping file names
    fname=${echo $f | sed 's/.txt//;s/Levels_CHG_Lab_S_//'}
    echo "Creating sub-directory [$fname]"
    mkdir "$fname"

    # Save the header
    awk 'NR==1{print $0}' $f > header

    # Split each file by third column
    echo "Splitting file $f"
    awk  'NR>1  {print $0 > $3".txt" }' $f

    # Move newly created files in sub directory
    mv {1..10}.txt $fname  # I have no idea how to do specify the files just created

    # Loop through the sub-files to attach header row:
    for subfile in $fname
      do
       cat header $subfile >> tmp_file
       mv -f tmp_file $subfile
      done
done

All these steps seem very complicated to me, I would very much appreciate if you could help me solve this in the right way. Thank you very much for your help. -fra

Upvotes: 0

Views: 932

Answers (1)

ebarrere
ebarrere

Reputation: 221

You have a few problems with your code right now. First of all, at no point do you list the contents of your downloads directory. You are simply setting the FILES variable to a string that is the path to that directory. You would need something like:

FILES=$(ls Downloads/directory1/*.txt)

You also never cd to the Downloads/directory1 folder, so your mkdir would create directories in cwd; probably not what you want.

If you know that the numbers in column 3 always range from 1 to 10, I would just pre-populate those files with the header line before you split the file.

Try this code to do what you want (untested):

BASEDIR=Downloads/directory1/
FILES=$(ls ${BASEDIR}/*.txt)

for f in $FILES; do
    # Create folder with root name by stripping file names
    dirname=$(echo $f | sed 's/.txt//;s/Levels_CHG_Lab_S_//')
    dirname="${BASENAME}/${dirname}/"
    echo "Creating sub-directory [$dirname]"
    mkdir "$dirname"

    # Save the header to each file
    HEADER_LINE=$(head -n1 $f)
    for i in {1..10}; do
      echo ${HEADER_LINE} > ${dirname}/${i}.txt
    done

    # Split each file by third column
    echo "Splitting file $f"
    awk -v dirname=${dirname} 'NR>1 {filename=dirname$3".txt"; print $0 >> filename }' $f
done

Upvotes: 1

Related Questions