Reputation: 477
I have a big file that looks like this (it has actually 12368 rows):
Header
175566717.000
175570730.000
175590376.000
175591966.000
175608932.000
175612924.000
175614836.000
.
.
.
175680016.000
175689679.000
175695803.000
175696330.000
What I want to do is, delete the header, then extract the first 2000 lines (line 1 to 2000), then extract the lines 1500 to 3500, then 3000 to 5000 and so on... What I mean is: extract a window of 2000 lines with an overlap of 500 lines between contiguous windows until the end of the file.
From a previous post, I got this:
tail -n +2 myfile.txt | awk 'BEGIN{file=1} ++count && count==2000 {print > "window"file; file++; count=500} {print > "window"file}'
But that isn't what I want. I don't have the 500 lines overlap and my first window has 1999 rows instead of 2000.
Any help would be appreciated
Upvotes: 1
Views: 104
Reputation: 212544
Reading the entire file into memory is usually not a great idea, and in this case is not necessary. Given a line number, you can easily compute which files it should go into. For example:
awk '{
a = int( NR / (t-d));
b = int( (NR-t) / (t-d)) ;
for( f = b; f<=a; f++ ) {
if( f >= 0 && (f * (t-d)) < NR && ( NR <= f *(t-d) + t))
print > ("window"(f+1))
}
}' t=2000 d=500
Upvotes: 0
Reputation: 3756
awk is not ideal for this. In Python you could do something like
with open("data") as fin:
lines = fin.readlines()
# remove header
lines = lines[1:]
# print the lines
i = 0
while True:
print "\n starting window"
if len(lines) < i+3000:
# we're done. whatever is left in the file will be ignored
break
for line in lines[i:i+3000]:
print line[:-1] # remove \n
i += 3000 - 500
Upvotes: 0
Reputation: 195229
awk -v i=1 -v t=2000 -v d=500 'NR>1{a[NR-1]=$0}
END{while(i<NR-1){for(k=i;k<i+t;k++)print a[k] > i".txt"; close(i".txt");i=i+t-d}}' file
try above line, you could change the numbers to fit your new requirement. you can define your own filenames too.
little test with t=10 (your 2000) and d=5 (your 500)
kent$ cat f
header
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
kent$ awk -v i=1 -v t=10 -v d=5 'NR>1{a[NR-1]=$0}END{while(i<NR-1){for(k=i;k<i+t;k++)print a[k] > i".txt"; close(i".txt");i=i+t-d}}' f
kent$ head *.txt
==> 1.txt <==
1
2
3
4
5
6
7
8
9
10
==> 6.txt <==
6
7
8
9
10
11
12
13
14
15
==> 11.txt <==
11
12
13
14
15
Upvotes: 3