Reputation: 29
I have a big file in which the third element $3
in each line is a value representing time.
I want to split my file so that I will get several file each having the lines in an interval of time. The number of lines can change from a file to another.
Example
Input file:
$xx_ at 0.0 "$elt_(0) coordinates 656.02 1819.19 0.00"
$xx_ at 1.0 "$elt_(0) coordinates 654.99 1818.19 1.44"
$xx_ at 1.0 "$elt_(1) coordinates 365.41 1284.31 0.00"
$xx_ at 4.0 "$elt_(0) coordinates 652.74 1816.04 3.12"
$xx_ at 4.0 "$elt_(1) coordinates 365.7 1281.79 2.54"
$xx_ at 5.0 "$elt_(0) coordinates 649.08 1812.52 5.08"
$xx_ at 5.0 "$elt_(1) coordinates 366.2 1277.44 4.37"
$xx_ at 8.0 "$elt_(0) coordinates 643.59 1807.23 7.62"
$xx_ at 8.0 "$elt_(1) coordinates 366.88 1271.47 6.01"
$xx_ at 10.0 "$elt_(0) coordinates 636.46 1800.37 9.90"
$xx_ at 10.0 "$elt_(1) coordinates 367.78 1263.63 7.90"
If I want to split by an interval of 5 seconds, I will have 3 files:
file1
:
$xx_ at 0.0 "$elt_(0) coordinates 656.02 1819.19 0.00"
$xx_ at 1.0 "$elt_(0) coordinates 654.99 1818.19 1.44"
$xx_ at 1.0 "$elt_(1) coordinates 365.41 1284.31 0.00"
$xx_ at 4.0 "$elt_(0) coordinates 652.74 1816.04 3.12"
$xx_ at 4.0 "$elt_(1) coordinates 365.7 1281.79 2.54"
$xx_ at 5.0 "$elt_(0) coordinates 649.08 1812.52 5.08"
$xx_ at 5.0 "$elt_(1) coordinates 366.2 1277.44 4.37"
file5
:
$xx_ at 8.0 "$elt_(0) coordinates 643.59 1807.23 7.62"
$xx_ at 8.0 "$elt_(1) coordinates 366.88 1271.47 6.01"
$xx_ at 10.0 "$elt_(0) coordinates 636.46 1800.37 9.90"
$xx_ at 10.0 "$elt_(1) coordinates 367.78 1263.63 7.90"
file10
:
$xx_ at 13.0 "$elt_(1) coordinates 380.78 1279.63 7.90"
Also, for each file, I want just to keep each element only once (the last time it appears) and I want to only keep the index of the element and the 2 numeric fields just after coordinates:
file1
:
0 649.08 1812.52
1 366.2 1277.44
Update: So from the two answers I got, I tried to mix both to get my answer
awk 'BEGIN{n=1}{x=$3;if(x>n*5){++n}{print > "file" n*5}}' file
for (i in file){awk 'BEGIN{}{if(($3+0)>max[$1])
{max[$1]=$3; line[$1]=$0}}END{for(i in line)
{print line[i];}}' file[i]}
Now the second part ( which is from the proposed uniq.awk), when tried on a single file gives me only a single unique line not all unique lines.
Moreover the for loop is giving me an error, although this is all I added for it
for (i in file){}
Upvotes: 1
Views: 235
Reputation: 1239
I wrote two awk
scripts. When used in conjunction they can accomplish this. Envoke first one (testsort.awk
) like:
./testsort.awk test.txt
where test.txt
is the input file. There are some diagnostic prints, real output is in the files named file0
, file5
... etc.
testsort.awk
uses internally uniq.awk
(both included below)
testsort.awk
:
#! /bin/gawk -f
BEGIN{max=0;}{
#use an array to map time values to first column value lists
if($3 in arr){
arr[$3]=arr[$3]" "$1;
}else{
arr[$3]=$1;
}
#use another array to store the whole line
arr2[$3"_"$1]=$0;
#keep track of the maximum time observed
if(($3+0)>max){
max=($3+0);
}
}
END{
#sort them into their files starting at zero
for(i=0;i<max;i+=5){
for(j in arr){
split(arr[j],a," ")
for(k in a){
idx=j"_"a[k];
num=(j+0);
if(num>i && num<=i+5){
output["file"i]=output["file"i]arr2[idx]"\n"
}
}
}
}
#write the appropriate files
for(i in output){
print i;
print output[i];
if(length(output[i])>0){
system("echo \""output[i]"\" |./uniq.awk|sort >"i);
}
}
}
uniq.awk
:
#! /bin/gawk -f
BEGIN{}{
#find the maxes
if(($3+0)>max[$1]){
max[$1]=$3
line[$1]=$0
}
}
END{
#write the appropriate files
for(i in line){
print line[i];
}
}
The solution also depends on having the shell utility sort
.
EDIT:
the specification of the input file was changed in the post, now I would do:
$sed -e 's/[$]//g' < test.txt > test_new.txt
to get rid of the annoying dollar signs in the original input
$./testsort_new.awk test_new.txt
new file testsort_new.awk
:
#! /usr/bin/awk -f
BEGIN{max=0;}{
#use an array to map time values to first column value lists
if($3 in arr){
arr[$3]=arr[$3]" "$4;
}else{
arr[$3]=$4;
}
#use another array to store the whole line
arr2[$3"_"$4]=$0;
#keep track of the maximum time observed
if(($3+0)>max){
max=($3+0);
}
}
END{
#sort them into their files starting at zero
for(i=0;i<max;i+=5){
for(j in arr){
split(arr[j],a," ")
for(k in a){
idx=j"_"a[k];
num=(j+0);
if(num>=i && num<i+5+1){
output["file"i]=output["file"i]arr2[idx]"\n"
}
}
}
}
#write the appropriate files
for(i in output){
print i;
print output[i];
if(length(output[i])>0){
target=output[i];
gsub("\"","\\\"",target);
system("echo \""target"\" |./uniq_new.awk|sort -k4 >"i);
}
}
}
new file uniq_new.awk
:
#! /bin/awk -f
BEGIN{}{
#find the maxes
if(($3+0)>max[$4]){
max[$4]=$3
line[$4]=$0
}
}
END{
#write the appropriate files
for(i in line){
print line[i];
}
}
The dollar signs will not be reproduced in the output.
Upvotes: 1
Reputation: 71
can't get exact requirement according to input. try below.
awk 'BEGIN{n=1}{x=$3;if(x>n*5){++n}{print > "file" n}}' file
Upvotes: 0