Run times/performance with awk on 4GB file

Question

I have written a script that gives me the next line after a pattern (the line I needed was between 46 times '=' in the line above and the one under it) and the line number of that line. After that I did a sed that formatted it so I am left with only the line between the 46*'='. And I write that to a file so I can further work with it.

The file I get out of this is really small, with at most 30 matches.

I started with this

awk '/^\={46}$/{ n=NR+1 } n>=NR {print NR","$0}' $file1 | sed -n '2~4p' > tmpfile$1

but
on a 4 GB file it needed 115 seconds, on a 1 GB file 12 seconds, on a 100 MB file 2 seconds.

I noticed that the last match across all files was always the same yet unique in the file itself, so I worked in an exit. The last match occurs after around the 50k-500k line and after that there are another 67 million lines for the 4 GB file(last match at 71k), 26 million lines for the 1 GB file(last match at 168k) and 2 million lines for the 100 MB file(last match at 414k).

awk '/^\={46}$/{ n=NR+1 } n>=NR {print NR","$0} /*unique string here*/{exit}' $file1 | sed -n '2~4p' > tmpfile$1

times I got was:
on a 4 GB file it needed 70 seconds, on a 1 GB file 2 seconds, on a 100 MB file 1 second
which is an improvement

I also tried a different order

awk '1;/*unique string here*/{exit}' $file1 | awk '/^\={46}$/{ n=NR+1 } n>=NR {print NR","$0}'  | sed -n '2~4p > tmpfile$1

and got
on a 4 GB file it needed 70 seconds, on a 1 GB file 5 seconds, on a 100 MB file 1 second

Now while having an exit inside the awk was an improvement, considering when the last match occured, I was expecting a better performance for the 4 GB file. At least when I look at how much time I saved with the 1 GB file.
Since the 3rd awk was slower than the 2nd awk for the 1 GB file, but had the same times for the 4 GB file, I guess I am running into some memory issue because the 4 GB file is too big and I am just using a Ubuntu VM with 2 CPUs and 4 GB of RAM.

This is my first time using awk, sed and scripting overall, so I dont know what to do now to get an even better time for the 4 GB file. I am ok with the 2 seconds for the 1 GB file.

Example of Input/Output

Random text here
blab

==============================================
Here is the string I need
==============================================

------------------------
random stuff
------------------------

other stuff

==============================================
Here is the 2nd string I need
==============================================
i dont need this string here

Random stuff

==============================================
last string I need, that is the same across all files
==============================================

a lot of lines are following the last match

Output:

5,Here is the string I need
15,Here is the 2nd string I need
22,last string I need, that is the same across all files`

edit1:Will update and try something new(spinning up a similar vm with more ram) on Monday

edit2: After spinning up a new vm and more testing with even bigger files (around 15 GB) and taking out caching as a factor, I havent noticed any big changes in runtimes with all the different codes posted here.

But the flag on, flag off {f=!f; next} is truly a lot more elegant than my code, so thanks for that James Brown and Ed Morton. If I could I would have picked both your answers :)

James Brown · Accepted Answer

How about this:

$ awk '/^\={46}$/ {f=!f; next} f {print NR, $0}' file
5 Here is the string I need
15 Here is the 2nd string I need
22 last string I need, that is the same across all files

String of =s flips up the flag f, prints after it until the next flippin' string that flips the flag down.

Run times/performance with awk on 4GB file

Answers (2)

Related Questions