Borderline
Borderline

Reputation: 53

AWK Split File every n-th Row but group IDs together

Lets assume I have the following file text.txt:

@something
@somethingelse
@anotherthing
1
2
2
3
3
3
4
4
4
5
5
6
7
7
8
9
9
9
10
11
11
11
14
15

I want to split this into multiple files by every 5th data row, but if the number of the next row is identical it should still end up in the same file. Header should be in every file, but that could also be ignored and reintroduced later.
This means something like this:

text.txt.1
@something
@somethingelse
@anotherthing
1
2
2
3
3
3

text.txt.2
@something
@somethingelse
@anotherthing
4
4
4
5
5

text.txt.3
@something
@somethingelse
@anotherthing
6
7
7
8
9
9
9

text.txt.4
@something
@somethingelse
@anotherthing
10
11
11
11
14

text.txt.5
@something
@somethingelse
@anotherthing
15

So I was thinking about something like this:

awk 'NR%5==1 && $1!=prev{i++;prev=$1}{print > FILENAME"."i}' test.txt

Both statements work by itself but not together.. is that possible using awk?

Upvotes: 5

Views: 468

Answers (3)

Tyl
Tyl

Reputation: 5252

Nice question.
With your example, this would work:

awk 'BEGIN{i=1;}/\@/{header= header == ""? $0 : header "\n" $0; next}c>=5 && $1!=prev{i++;c=0;}{if(!c) print header>FILENAME"."i; print > FILENAME"."i;c++;prev=$1;}' test.txt

You need strip the header out, and set a counter (c in above), NR is just current line number of the input, it will not meet your needs when the actual lines are not times of 5.

Break it up and improve a tiny bit:

awk 'BEGIN{i=1;}
  /\@/{header= header == ""? $0 : header ORS $0; next}
  c>=5 && $1!=prev{i++;c=0;}
  !c {print header>FILENAME"."i;}
  {print > FILENAME"."i;c++;prev=$1;}
  ' test.txt

To solve the potential problems mentioned in the comment:

awk 'BEGIN{i=1}
  /\@/{header= header == ""? $0 : header ORS $0; next}
  c>=5 && $1!=prev{i++;c=0}
  !c {close(f);f=(FILENAME"."i);print header>f}
  {print>f;c++;prev=$1}
  ' test.txt

or check Ed's answer which is more precise and different platforms/versions compatible.

Upvotes: 5

Ed Morton
Ed Morton

Reputation: 203684

Using any awk in any shell on every Unix box:

$ cat tst.awk
/^@/ {
    hdr = hdr $0 ORS
    next
}
( (++numLines) % 5 ) == 1 {
    if ( $0 == prev ) {
        --numLines
    }
    else {
        close(out)
        out = FILENAME "." (++numBlocks)
        printf "%s", hdr > out
        numLines = 1
    }
}
{
    print > out
    prev = $0
}

$ awk -f tst.awk text.txt

$ head text.txt.*
==> text.txt.1 <==
@something
@somethingelse
@anotherthing
1
2
2
3
3
3

==> text.txt.2 <==
@something
@somethingelse
@anotherthing
4
4
4
5
5

==> text.txt.3 <==
@something
@somethingelse
@anotherthing
6
7
7
8
9
9
9

==> text.txt.4 <==
@something
@somethingelse
@anotherthing
10
11
11
11
14

==> text.txt.5 <==
@something
@somethingelse
@anotherthing
15

Upvotes: 4

RavinderSingh13
RavinderSingh13

Reputation: 133538

With your shown samples, please try following awk program. Written and tested in GNU awk.

awk '
BEGIN{
  outFile="test.txt"
  count=1
}
/@/{
  header=(header?header ORS:"")$0
  next
}
{
  arr[$0]=(arr[$0]?arr[$0] ORS:"")$0
}
END{
  PROCINFO["sorted_in"] = "@ind_num_asc"
  print header > (outFile count)
  for(i in arr){
    num=split(arr[i],arr2,"\n")
    print arr[i] > (outFile count)
    len+=num
    if(len>=5){ len=0 }
    if(len==0){
      close(outFile count)
      count++
      print header > (outFile count)
    }
  }
}
'  Input_file

Upvotes: 4

Related Questions