Reputation: 83
I am trying to write a regular expression that will match the version number from a configuration file. I am trying to match and extract the version number from the two following numbering patterns
1) <version>2.343</version>
2) <version>2.343.2</version>
Such that a result is returned of either
1) 2.343
2) 2.343.2
My current solution- looks like either one of these two awk commands with the regex pattern to match both cases individually. But there must be a solution that covers both cases?
awk 'match($0, /[0-9][.][0-9][0-9][0-9]/) {print substr($0, RSTART, RLENGTH) }' config.xml
awk 'match($0, /[0-9][.][0-9][0-9][0-9].[0-9]/) {print substr($0, RSTART, RLENGTH) }' config.xml
Upvotes: 7
Views: 424
Reputation: 29025
Here is another awk
solution (tested with GNU and BSD awk
) that tries to match exactly the two numbering patterns shown in the OP (<version>N.NNN</version>
and <version>N.NNN.N</version>
where N
is any digit). It assumes that <version>...</version>
tags are properly balanced, do not appear in comments, strings... and do not span over multiple lines. If several version numbers appear on the same line they are all printed.
awk -F '</?version>' '{
for(i=1; i<=NF/2; i++)
if($(2*i) ~ /^[0-9]\.[0-9]{3}(\.[0-9])?$/) print $(2*i)
}' config.xml
If the components of version numbers can have any number of digits (minimum 1) just relax the regular expression: /^[0-9]+(\.[0-9]+){1,2}$/
. And if there can be any number of components (minimum 1) relax a bit more: /^[0-9]+(\.[0-9]+)*$/
(or /^[0-9]+(\.[0-9]+)+$/
for at least 2 components).
If <version>...</version>
tags are not properly balanced, can appear in comments, or can span over several lines, a real XML parser would be a much better solution than a general purpose utility like awk
.
Upvotes: 4
Reputation: 2805
absolutely no need to invoke match()
or resort to vendor-proprietary solutions
nawk ++NF OFS='' FS='(^[^>]*)?[<][/]?version[>]($)?'
2.343
2.343.2
the brute-force approaches :
gawk NF=NF OFS= FS='^[^>]+>|<[/].+$' # kinda brute-force mawk NF++ OFS= FS='^[^>]+.|./.+$' # REALLY brute-force
2.343
2.343.2
Upvotes: 3
Reputation: 36390
Your two commands might be melded into one using ?
meaning zero-or-one repetitions as follows
awk 'match($0, /[0-9][.][0-9][0-9][0-9](.[0-9])?/) {print substr($0, RSTART, RLENGTH) }' config.xml
which for config.xml
content as follows
1) <version>2.343</version>
2) <version>2.343.2</version>
gives output
2.343
2.343.2
(tested in gawk 4.2.1)
Upvotes: 3
Reputation: 133458
1st solution: With your shown samples please try following. Using match
function of awk
here, should work in any POSIX awk
version. Using regex >[0-9]+(\.[0-9]+)*<
to match values from >
followed by version followed by >
and if regex match is found then printing sub string of matched values.
awk 'match($0,/>[0-9]+(\.[0-9]+)*</){print substr($0,RSTART+1,RLENGTH-2)}' Input_file
OR In case you want to exactly looking for version tag then try following:
awk 'match($0,/<version>[0-9]+(\.[0-9]+)*<\/version>/){print substr($0,RSTART+9,RLENGTH-19)}' Input_file
2nd solution: With your shown samples. Using GNU awk
's RS
variable with same concept of using regex in it and getting values.
awk -v RS='<version>[0-9]+(\\.[0-9]+)*<\\/version>' 'RT{split(RT,arr,"[><]");print arr[3]}' Input_file
Upvotes: 9
Reputation: 37404
Using GNU awk and the third argument of match()
:
$ gawk 'match($0,/<version>(.*)<\/version>/,a){print a[1]}' file
2.343
2.343.2
Upvotes: 5
Reputation: 785068
You may use:
awk 'match($0, /[0-9]+(\.[0-9]+)+/) {
print $0, substr($2, RSTART, RLENGTH)}' file
1) 2.343
2) 2.343.2
Upvotes: 5