a different ben
a different ben

Reputation: 4008

Split text file into parts based on a pattern taken from the text file

I have many text files of fixed-width data, e.g.:

$ head model-q-060.txt 
% x                      y                        
15.0                     0.0                      
15.026087                -1.0                     
15.052174                -2.0                     
15.07826                 -3.0                     
15.104348                -4.0                     
15.130435                -5.0                     
15.156522                -6.0                     
15.182609                -6.9999995               
15.208695                -8.0  

The data comprise 3 or 4 runs of a simulation, all stored in the one text file, with no separator between runs. In other words, there is no empty line or anything, e.g. if there were only 3 'records' per run it would look like this for 3 runs:

$ head model-q-060.txt 
% x                      y                        
15.0                     0.0                      
15.026087                -1.0                     
15.052174                -2.0                     
15.0                     0.0                      
15.038486                -1.0                     
15.066712                -2.0                     
15.0                     0.0                      
15.041089                -1.0                     
15.087612                -2.0                     

It's a COMSOL Multiphysics output file for those interested. Visually you can tell where the new run data begin, as the first x-value is repeated (actually the entire second line might be the same for all of them). So I need to firstly open the file and get this x-value, save it, then use it as a pattern to match with awk or csplit. I am struggling to work this out!

csplit will do the job:

$ csplit -z -f 'temp' -b '%02d.txt' model-q-060.txt /^15\.0\\s/ {*}

but I have to know the pattern to split on. This question is similar but each of my text files might have a different pattern to match: Split files based on file content and pattern matching.

Ben.

Upvotes: 3

Views: 6360

Answers (3)

Jim Garrison
Jim Garrison

Reputation: 86774

Here's a simple awk script that will do what you want:

BEGIN { fn=0 }
NR==1 { next }
NR==2 { delim=$1 }
$1 == delim {
    f=sprintf("test%02d.txt",fn++);
    print "Creating " f
}

{ print $0 > f }
  1. initialize output file number
  2. ignore the first line
  3. extract the delimiter from the second line
  4. for every input line whose first token matches the delimiter, set up the output file name
  5. for all lines, write to the current output file

Upvotes: 3

Blackle Mori
Blackle Mori

Reputation: 1181

If the amount of lines per run is constant, you could use this:

cat your_file.txt | grep -P "^\d" | \
   split --lines=$(expr \( $(wc -l "your_file.txt" | \
   awk '{print $1'}) - 1 \) / number_of_runs)

Upvotes: 0

icyrock.com
icyrock.com

Reputation: 28628

This should do the job - test somewhere you don't have a lot of temp*.txt files: :)

rm -f temp*.txt

cat > f1.txt <<EOF
% x                      y                        
15.0                     0.0                      
15.026087                -1.0                     
15.052174                -2.0                     
15.0                     0.0                      
15.038486                -1.0                     
15.066712                -2.0                     
15.0                     0.0                      
15.041089                -1.0                     
15.087612                -2.0    
EOF

first=`awk 'NR==2{print $1}' f1.txt|sed 's/\\./\\\\./'`
echo --- Splitting by: $first

csplit -z -f temp -b %02d.txt f1.txt /^"$first"\\s/ {*}

for i in temp*.txt; do
  echo ---- $i
  cat $i
done

The output of the above is:

--- Splitting by: 15\.0
51
153
153
136
---- temp00.txt
% x                      y                        
---- temp01.txt
15.0                     0.0                      
15.026087                -1.0                     
15.052174                -2.0                     
---- temp02.txt
15.0                     0.0                      
15.038486                -1.0                     
15.066712                -2.0                     
---- temp03.txt
15.0                     0.0                      
15.041089                -1.0                     
15.087612                -2.0    

Of course, you will run into trouble if you have repeating second column value (15.0 in the above example) - solving that would be a tad harder - exercise left for the reader...

Upvotes: 1

Related Questions