Bert
Bert

Reputation: 83

Regular expression to match and extract two version number patterns

I am trying to write a regular expression that will match the version number from a configuration file. I am trying to match and extract the version number from the two following numbering patterns

1) <version>2.343</version>
2) <version>2.343.2</version>

Such that a result is returned of either

1) 2.343
2) 2.343.2

My current solution- looks like either one of these two awk commands with the regex pattern to match both cases individually. But there must be a solution that covers both cases?

awk 'match($0, /[0-9][.][0-9][0-9][0-9]/) {print substr($0, RSTART, RLENGTH) }' config.xml
awk 'match($0, /[0-9][.][0-9][0-9][0-9].[0-9]/) {print substr($0, RSTART, RLENGTH) }' config.xml

Upvotes: 7

Views: 424

Answers (6)

Renaud Pacalet
Renaud Pacalet

Reputation: 29025

Here is another awk solution (tested with GNU and BSD awk) that tries to match exactly the two numbering patterns shown in the OP (<version>N.NNN</version> and <version>N.NNN.N</version> where N is any digit). It assumes that <version>...</version> tags are properly balanced, do not appear in comments, strings... and do not span over multiple lines. If several version numbers appear on the same line they are all printed.

awk -F '</?version>' '{
  for(i=1; i<=NF/2; i++)
    if($(2*i) ~ /^[0-9]\.[0-9]{3}(\.[0-9])?$/) print $(2*i)
}' config.xml

If the components of version numbers can have any number of digits (minimum 1) just relax the regular expression: /^[0-9]+(\.[0-9]+){1,2}$/. And if there can be any number of components (minimum 1) relax a bit more: /^[0-9]+(\.[0-9]+)*$/ (or /^[0-9]+(\.[0-9]+)+$/ for at least 2 components).

If <version>...</version> tags are not properly balanced, can appear in comments, or can span over several lines, a real XML parser would be a much better solution than a general purpose utility like awk.

Upvotes: 4

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2805

absolutely no need to invoke match() or resort to vendor-proprietary solutions

nawk ++NF OFS='' FS='(^[^>]*)?[<][/]?version[>]($)?'
2.343
2.343.2

the brute-force approaches :

gawk NF=NF OFS= FS='^[^>]+>|<[/].+$'  # kinda brute-force

mawk NF++ OFS= FS='^[^>]+.|./.+$'     # REALLY brute-force
2.343
2.343.2

Upvotes: 3

Daweo
Daweo

Reputation: 36390

Your two commands might be melded into one using ? meaning zero-or-one repetitions as follows

awk 'match($0, /[0-9][.][0-9][0-9][0-9](.[0-9])?/) {print substr($0, RSTART, RLENGTH) }' config.xml

which for config.xml content as follows

1) <version>2.343</version>
2) <version>2.343.2</version>

gives output

2.343
2.343.2

(tested in gawk 4.2.1)

Upvotes: 3

RavinderSingh13
RavinderSingh13

Reputation: 133458

1st solution: With your shown samples please try following. Using match function of awk here, should work in any POSIX awk version. Using regex >[0-9]+(\.[0-9]+)*< to match values from > followed by version followed by > and if regex match is found then printing sub string of matched values.

awk 'match($0,/>[0-9]+(\.[0-9]+)*</){print substr($0,RSTART+1,RLENGTH-2)}' Input_file

OR In case you want to exactly looking for version tag then try following:

awk 'match($0,/<version>[0-9]+(\.[0-9]+)*<\/version>/){print substr($0,RSTART+9,RLENGTH-19)}'  Input_file


2nd solution: With your shown samples. Using GNU awk's RS variable with same concept of using regex in it and getting values.

awk -v RS='<version>[0-9]+(\\.[0-9]+)*<\\/version>' 'RT{split(RT,arr,"[><]");print arr[3]}' Input_file

Upvotes: 9

James Brown
James Brown

Reputation: 37404

Using GNU awk and the third argument of match():

$ gawk 'match($0,/<version>(.*)<\/version>/,a){print a[1]}' file
2.343
2.343.2

Upvotes: 5

anubhava
anubhava

Reputation: 785068

You may use:

awk 'match($0, /[0-9]+(\.[0-9]+)+/) {
   print $0, substr($2, RSTART, RLENGTH)}' file

1) 2.343
2) 2.343.2

Upvotes: 5

Related Questions