Reputation: 41
I have written a script that gives me the next line after a pattern (the line I needed was between 46 times '=' in the line above and the one under it) and the line number of that line. After that I did a sed that formatted it so I am left with only the line between the 46*'='. And I write that to a file so I can further work with it.
The file I get out of this is really small, with at most 30 matches.
I started with this
awk '/^\={46}$/{ n=NR+1 } n>=NR {print NR","$0}' $file1 | sed -n '2~4p' > tmpfile$1
but
on a 4 GB file it needed 115 seconds, on a 1 GB file 12 seconds, on a 100 MB file 2 seconds.
I noticed that the last match across all files was always the same yet unique in the file itself, so I worked in an exit. The last match occurs after around the 50k-500k line and after that there are another 67 million lines for the 4 GB file(last match at 71k), 26 million lines for the 1 GB file(last match at 168k) and 2 million lines for the 100 MB file(last match at 414k).
awk '/^\={46}$/{ n=NR+1 } n>=NR {print NR","$0} /*unique string here*/{exit}' $file1 | sed -n '2~4p' > tmpfile$1
times I got was:
on a 4 GB file it needed 70 seconds, on a 1 GB file 2 seconds, on a 100 MB file 1 second
which is an improvement
I also tried a different order
awk '1;/*unique string here*/{exit}' $file1 | awk '/^\={46}$/{ n=NR+1 } n>=NR {print NR","$0}' | sed -n '2~4p > tmpfile$1
and got
on a 4 GB file it needed 70 seconds, on a 1 GB file 5 seconds, on a 100 MB file 1 second
Now while having an exit inside the awk was an improvement, considering when the last match occured, I was expecting a better performance for the 4 GB file. At least when I look at how much time I saved with the 1 GB file.
Since the 3rd awk was slower than the 2nd awk for the 1 GB file, but had the same times for the 4 GB file, I guess I am running into some memory issue because the 4 GB file is too big and I am just using a Ubuntu VM with 2 CPUs and 4 GB of RAM.
This is my first time using awk, sed and scripting overall, so I dont know what to do now to get an even better time for the 4 GB file. I am ok with the 2 seconds for the 1 GB file.
Example of Input/Output
Random text here
blab
==============================================
Here is the string I need
==============================================
------------------------
random stuff
------------------------
other stuff
==============================================
Here is the 2nd string I need
==============================================
i dont need this string here
Random stuff
==============================================
last string I need, that is the same across all files
==============================================
a lot of lines are following the last match
Output:
5,Here is the string I need
15,Here is the 2nd string I need
22,last string I need, that is the same across all files`
edit1:Will update and try something new(spinning up a similar vm with more ram) on Monday
edit2: After spinning up a new vm and more testing with even bigger files (around 15 GB) and taking out caching as a factor, I havent noticed any big changes in runtimes with all the different codes posted here.
But the flag on, flag off {f=!f; next} is truly a lot more elegant than my code, so thanks for that James Brown and Ed Morton. If I could I would have picked both your answers :)
Upvotes: 3
Views: 649
Reputation: 203229
You never need sed when you're using awk. You don't need to escape =
as it's not a metacharacter. String concatenation is slow. Regexp comparisons are slower than string comparisons. It doesn't make sense to test for n>=NR
since n
is only greater than NR
for the ==*
line you don't want. You're currently printing the line after every == line but you only want the lines between pairs of them. If your "unique string" is one of the lines you want to print then just test it where you're printing it instead of for every line in the file. Try:
$ awk -v OFS=',' '
$0=="=============================================="{f=!f; next}
f {print NR, $0; if (/unique string/) exit}
' file
5,Here is the string I need
15,Here is the 2nd string I need
22,last string I need, that is the same across all files
and to see what difference the regexp comparison makes you can try this too:
awk -v OFS=',' '
/^={46}$/{f=!f; next}
f {print NR, $0; if (/unique string/) exit}
' file
and even just not forcing awk to count the 46 =
s would probably be faster:
awk -v OFS=',' '
/^=+$/{f=!f; next}
f {print NR, $0; if (/unique string/) exit}
' file
Upvotes: 4
Reputation: 37394
How about this:
$ awk '/^\={46}$/ {f=!f; next} f {print NR, $0}' file
5 Here is the string I need
15 Here is the 2nd string I need
22 last string I need, that is the same across all files
String of =
s flips up the flag f
, prints after it until the next flippin' string that flips the flag down.
Upvotes: 3