user5309822
user5309822

Reputation:

awk: exclude pattern, match only expression, sort, uniq

I'm trying to process 500,000 lines of text. My code below works but seems incredibly insufficient, to me. I want to test this theory by accomplishing this with awk to see if I have any sort of time savings. This code block is replicated through out my script using various variables. A savings of time here would equate to 10 times the savings at end of script. However, I'm really struggling to achieve this with awk.

Script:

_regex_ipv4_ip_='((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])'

_regex_ipv4_cidr_='(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])(\/([0-9]|[1-2][0-9]|3[0-2]))'

grep -v '^#' ${tmp}/url_* | grep -Eho "${_regex_ipv4_ip_}|${_regex_ipv4_cidr_}" | sort | uniq >${tmp}/ipv4

How do I, with only awk... Search multiple files. Excluded lines matching pattern. Bring bash variable into awk. emulate 'grep -o' using a regular expression. sort results (care less if they're sorted just needed for 'uniq' in bash) emulate uniq write results to file

Input File(s) look like this

#Comment
http://192.168.0.1/whatever #Comment
192.168.0.1
http://192.168.0.1/whatever/whatever
192.168.0.1 #Comment
192.168.0.0/16
192.168.0.0/16 #Comment

Output after duplicates removed...

192.168.0.1 192.168.0.0/16

Update: 1

Here is where I am at now...

This works exactly the way I want it too..

_regex_ipv4_ip_='192.168.0.1'
_regex_ipv4_cidr_='192.168.0.0/16'

awk -v exclude='#' -v include="${_regex_ipv4_ip_}" -v include2="${_regex_ipv4_cidr_}" '($0 !~ exclude) && match($0,include) && !seen[substr($0,RSTART,RLENGTH)]++ || match($0,include2) && !seen[substr($0,RSTART,RLENGTH)]++' /home/master/Desktop/t_*

However I can't properly carry my regular expression contained in a variable in to awk correctly.

_regex_ipv4_ip_='((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])'
_regex_ipv4_cidr_='(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])(\/([0-9]|[1-2][0-9]|3[0-2]))'

Upvotes: 1

Views: 3100

Answers (1)

Ed Morton
Ed Morton

Reputation: 203493

What you'll want is something like:

awk -v exclude='whatever' -v include='whatever' '
($0 !~ exclude) && match($0,include) && !seen[substr($0,RSTART,RLENGTH)]++
' file1 file2 ... fileN

but until you post sample input/output we can't fill in the details.

Upvotes: 3

Related Questions