Reputation:
I'm trying to process 500,000 lines of text. My code below works but seems incredibly insufficient, to me. I want to test this theory by accomplishing this with awk to see if I have any sort of time savings. This code block is replicated through out my script using various variables. A savings of time here would equate to 10 times the savings at end of script. However, I'm really struggling to achieve this with awk.
Script:
_regex_ipv4_ip_='((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])' _regex_ipv4_cidr_='(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])(\/([0-9]|[1-2][0-9]|3[0-2]))' grep -v '^#' ${tmp}/url_* | grep -Eho "${_regex_ipv4_ip_}|${_regex_ipv4_cidr_}" | sort | uniq >${tmp}/ipv4
How do I, with only awk... Search multiple files. Excluded lines matching pattern. Bring bash variable into awk. emulate 'grep -o' using a regular expression. sort results (care less if they're sorted just needed for 'uniq' in bash) emulate uniq write results to file
Input File(s) look like this
#Comment http://192.168.0.1/whatever #Comment 192.168.0.1 http://192.168.0.1/whatever/whatever 192.168.0.1 #Comment 192.168.0.0/16 192.168.0.0/16 #Comment
Output after duplicates removed...
192.168.0.1 192.168.0.0/16
Update: 1
Here is where I am at now...
This works exactly the way I want it too..
_regex_ipv4_ip_='192.168.0.1'
_regex_ipv4_cidr_='192.168.0.0/16'
awk -v exclude='#' -v include="${_regex_ipv4_ip_}" -v include2="${_regex_ipv4_cidr_}" '($0 !~ exclude) && match($0,include) && !seen[substr($0,RSTART,RLENGTH)]++ || match($0,include2) && !seen[substr($0,RSTART,RLENGTH)]++' /home/master/Desktop/t_*
However I can't properly carry my regular expression contained in a variable in to awk correctly.
_regex_ipv4_ip_='((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])'
_regex_ipv4_cidr_='(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])(\/([0-9]|[1-2][0-9]|3[0-2]))'
Upvotes: 1
Views: 3100
Reputation: 203493
What you'll want is something like:
awk -v exclude='whatever' -v include='whatever' '
($0 !~ exclude) && match($0,include) && !seen[substr($0,RSTART,RLENGTH)]++
' file1 file2 ... fileN
but until you post sample input/output we can't fill in the details.
Upvotes: 3