student beginner
student beginner

Reputation: 29

Split a file by a variable range of lines

I have a big file in which the third element $3 in each line is a value representing time.

I want to split my file so that I will get several file each having the lines in an interval of time. The number of lines can change from a file to another.

Example

Input file:

$xx_ at 0.0 "$elt_(0) coordinates 656.02 1819.19 0.00"
$xx_ at 1.0 "$elt_(0) coordinates 654.99 1818.19 1.44"
$xx_ at 1.0 "$elt_(1) coordinates 365.41 1284.31 0.00"
$xx_ at 4.0 "$elt_(0) coordinates 652.74 1816.04 3.12"
$xx_ at 4.0 "$elt_(1) coordinates 365.7 1281.79 2.54"
$xx_ at 5.0 "$elt_(0) coordinates 649.08 1812.52 5.08"
$xx_ at 5.0 "$elt_(1) coordinates 366.2 1277.44 4.37"
$xx_ at 8.0 "$elt_(0) coordinates 643.59 1807.23 7.62"
$xx_ at 8.0 "$elt_(1) coordinates 366.88 1271.47 6.01"
$xx_ at 10.0 "$elt_(0) coordinates 636.46 1800.37 9.90"
$xx_ at 10.0 "$elt_(1) coordinates 367.78 1263.63 7.90"

If I want to split by an interval of 5 seconds, I will have 3 files:

file1:

$xx_ at 0.0 "$elt_(0) coordinates 656.02 1819.19 0.00"
$xx_ at 1.0 "$elt_(0) coordinates 654.99 1818.19 1.44"
$xx_ at 1.0 "$elt_(1) coordinates 365.41 1284.31 0.00"
$xx_ at 4.0 "$elt_(0) coordinates 652.74 1816.04 3.12"
$xx_ at 4.0 "$elt_(1) coordinates 365.7 1281.79 2.54"
$xx_ at 5.0 "$elt_(0) coordinates 649.08 1812.52 5.08"
$xx_ at 5.0 "$elt_(1) coordinates 366.2 1277.44 4.37"

file5:

$xx_ at 8.0 "$elt_(0) coordinates 643.59 1807.23 7.62"
$xx_ at 8.0 "$elt_(1) coordinates 366.88 1271.47 6.01"
$xx_ at 10.0 "$elt_(0) coordinates 636.46 1800.37 9.90"
$xx_ at 10.0 "$elt_(1) coordinates 367.78 1263.63 7.90"

file10:

$xx_ at 13.0 "$elt_(1) coordinates 380.78 1279.63 7.90"

Also, for each file, I want just to keep each element only once (the last time it appears) and I want to only keep the index of the element and the 2 numeric fields just after coordinates:

file1:

0 649.08 1812.52 
1 366.2 1277.44 

Update: So from the two answers I got, I tried to mix both to get my answer

awk 'BEGIN{n=1}{x=$3;if(x>n*5){++n}{print > "file" n*5}}' file

for (i in file){awk 'BEGIN{}{if(($3+0)>max[$1])
{max[$1]=$3; line[$1]=$0}}END{for(i in line)
{print line[i];}}' file[i]}

Now the second part ( which is from the proposed uniq.awk), when tried on a single file gives me only a single unique line not all unique lines.

Moreover the for loop is giving me an error, although this is all I added for it

for (i in file){}

Upvotes: 1

Views: 235

Answers (2)

villaa
villaa

Reputation: 1239

I wrote two awk scripts. When used in conjunction they can accomplish this. Envoke first one (testsort.awk) like:

./testsort.awk test.txt

where test.txt is the input file. There are some diagnostic prints, real output is in the files named file0, file5 ... etc.

testsort.awk uses internally uniq.awk (both included below)

testsort.awk:

#! /bin/gawk -f

BEGIN{max=0;}{

  #use an array to map time values to first column value lists
  if($3 in arr){
    arr[$3]=arr[$3]" "$1;
  }else{
    arr[$3]=$1;
  }

  #use another array to store the whole line
  arr2[$3"_"$1]=$0;

  #keep track of the maximum time observed
  if(($3+0)>max){
    max=($3+0);
  }
}
END{

  #sort them into their files starting at zero
  for(i=0;i<max;i+=5){
    for(j in arr){
      split(arr[j],a," ")
      for(k in a){
        idx=j"_"a[k];
        num=(j+0);
        if(num>i && num<=i+5){
          output["file"i]=output["file"i]arr2[idx]"\n"
        }
      }
    }
  }

  #write the appropriate files
  for(i in output){
    print i;
    print output[i];
    if(length(output[i])>0){
      system("echo \""output[i]"\" |./uniq.awk|sort >"i);
    }
  }
}

uniq.awk:

#! /bin/gawk -f

BEGIN{}{

  #find the maxes
  if(($3+0)>max[$1]){
    max[$1]=$3
    line[$1]=$0
  }

}
END{

  #write the appropriate files
  for(i in line){
    print line[i];
  }
}    

The solution also depends on having the shell utility sort.

EDIT:
the specification of the input file was changed in the post, now I would do:

  1. $sed -e 's/[$]//g' < test.txt > test_new.txt to get rid of the annoying dollar signs in the original input

  2. $./testsort_new.awk test_new.txt

new file testsort_new.awk:

#! /usr/bin/awk -f

BEGIN{max=0;}{

  #use an array to map time values to first column value lists
  if($3 in arr){
    arr[$3]=arr[$3]" "$4;
  }else{
    arr[$3]=$4;
  }

  #use another array to store the whole line
  arr2[$3"_"$4]=$0;

  #keep track of the maximum time observed
  if(($3+0)>max){
    max=($3+0);
  }
}
END{

  #sort them into their files starting at zero
  for(i=0;i<max;i+=5){
    for(j in arr){
      split(arr[j],a," ")
      for(k in a){
        idx=j"_"a[k];
        num=(j+0);
        if(num>=i && num<i+5+1){
          output["file"i]=output["file"i]arr2[idx]"\n"
        }
      }
    }
  }

  #write the appropriate files
  for(i in output){
    print i;
    print output[i];
    if(length(output[i])>0){
      target=output[i];
      gsub("\"","\\\"",target);
      system("echo \""target"\" |./uniq_new.awk|sort -k4 >"i);
    }
  }
}

new file uniq_new.awk:

#! /bin/awk -f

BEGIN{}{

  #find the maxes
  if(($3+0)>max[$4]){
    max[$4]=$3
    line[$4]=$0
  }

}
END{

  #write the appropriate files
  for(i in line){
    print line[i];
  }
}

The dollar signs will not be reproduced in the output.

Upvotes: 1

Cron
Cron

Reputation: 71

can't get exact requirement according to input. try below.

awk 'BEGIN{n=1}{x=$3;if(x>n*5){++n}{print > "file" n}}' file

Upvotes: 0

Related Questions