Reputation: 477

Extracting several rows with overlap using awk

I have a big file that looks like this (it has actually 12368 rows):

 Header
 175566717.000
 175570730.000
 175590376.000
 175591966.000
 175608932.000
 175612924.000
 175614836.000
 .
 .
 .
 175680016.000
 175689679.000
 175695803.000
 175696330.000

What I want to do is, delete the header, then extract the first 2000 lines (line 1 to 2000), then extract the lines 1500 to 3500, then 3000 to 5000 and so on... What I mean is: extract a window of 2000 lines with an overlap of 500 lines between contiguous windows until the end of the file.

From a previous post, I got this:

tail -n +2 myfile.txt | awk 'BEGIN{file=1} ++count && count==2000 {print > "window"file; file++; count=500} {print > "window"file}'

But that isn't what I want. I don't have the 500 lines overlap and my first window has 1999 rows instead of 2000.

Any help would be appreciated

Upvotes: 1

Answers (3)

William Pursell

Reputation: 212544

Reading the entire file into memory is usually not a great idea, and in this case is not necessary. Given a line number, you can easily compute which files it should go into. For example:

awk '{
        a = int( NR / (t-d));
        b = int( (NR-t) / (t-d)) ;  
        for( f = b; f<=a; f++ ) {
            if( f >= 0 && (f * (t-d)) < NR  &&  ( NR <= f *(t-d) + t))
                print > ("window"(f+1))
        } 
    }' t=2000 d=500

Upvotes: 0

Jacob Stevenson

Reputation: 3756

awk is not ideal for this. In Python you could do something like

with open("data") as fin:
    lines = fin.readlines()

    # remove header
    lines = lines[1:]

    # print the lines
    i = 0
    while True:
        print "\n starting window"
        if len(lines) < i+3000:
            # we're done.  whatever is left in the file will be ignored
            break
        for line in lines[i:i+3000]:
            print line[:-1] # remove \n

        i += 3000 - 500

Upvotes: 0

Kent

Reputation: 195229

 awk -v i=1 -v t=2000 -v d=500 'NR>1{a[NR-1]=$0}
END{while(i<NR-1){for(k=i;k<i+t;k++)print a[k] > i".txt"; close(i".txt");i=i+t-d}}' file

try above line, you could change the numbers to fit your new requirement. you can define your own filenames too.

little test with t=10 (your 2000) and d=5 (your 500)

kent$  cat f
header
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

kent$  awk -v i=1 -v t=10 -v d=5 'NR>1{a[NR-1]=$0}END{while(i<NR-1){for(k=i;k<i+t;k++)print a[k] > i".txt"; close(i".txt");i=i+t-d}}' f

kent$  head *.txt                                                                                                                      
==> 1.txt <==
1
2
3
4
5
6
7
8
9
10

==> 6.txt <==
6
7
8
9
10
11
12
13
14
15

==> 11.txt <==
11
12
13
14
15

Upvotes: 3

Extracting several rows with overlap using awk

Answers (3)

Related Questions