Reputation:
In (Visual Basic, .NET):
Dim result As Match = Regex.Match(aStr, aMatchStr)
If result.Success Then
Dim result0 As String = result.Groups(0).Value
Dim result1 As String = result.Groups(1).Value
End If
With: aStr equal to (whitespace is normal space and there are seven spaces between n
and (
):
"AMEVDIEERPK + 7 Oxidation (M)"
Why does result1
become an empty string for aMatchStr equal to
"\s*(\d*).*?Oxidation\s+\(M\)"
but becomes "7" for aMatchStr
equal to
"\s*(\d*)\s*Oxidation\s+\(M\)"
?
(result0
becomes equal to "AMEVDIEERPK + 7 Oxidation (M)")
(This is from MSQuant, MascotResultParser.vb, function modificationParseMatch()
).
Upvotes: 0
Views: 569
Reputation:
I am sorry, there is more to the syntax...
The plus sign can not be relied on. It separates the (peptide) sequence and the (peptide) modifications. There can be more than one modification for each sequence. Sample with two modifications (there is 7 spaces between "2" and "L"):
"KLIDLTQFPAFVTPMGK + Oxidation (M); 2 Lysine-13C615N2 (K-full)"
The user could specify "\S+\s+(K-full)" for the second modification and "2" should be extracted.
Here are some more sample lines (after the plus sign):
" Phospho (ST); 2 Dimethyl (K); Dimethyl (N-term)"
" Phospho (ST); 2 Dimethyl:2H(4) (K); Dimethyl:2H(4) (N-term)"
" N-Acetyl (Protein)"
" 2 Dimethyl:2H(4) (K); Dimethyl:2H(4) (N-term)"
" N-Acetyl (Protein); 2 Lysine-13C615N2 (K-full)"
" Oxidation (M); N-Acetyl (Protein)"
" Oxidation (M); N-Acetyl (Protein); Lysine-13C615N2 (K-full)"
" N-Acetyl (Protein); Lysine-13C615N2 (K-full)"
" Oxidation (M); Lysine-13C615N2 (K-full)"
" Oxidation (M)"
" 2 Oxidation (M); Lysine-13C615N2 (K-full)"
A sample file with user defined rules can be found at (packed in 7-zip format):
<http://www.pil.sdu.dk/1/MSQuant/CEBIquantModes,2008-11-10.7z>
Upvotes: 1
Reputation: 321578
I think it's because the matching starts at the first character and moves on from there...
For your first regular expression:
Does "AMEVDIEERPK + 7 Oxidation (M)" match "\s*(\d*).*?Oxidation\s+(M)"? Yes.. stop matching.
For your second regular expression:
Does "AMEVDIEERPK + 7 Oxidation (M)" match "\s*(\d*)\s*Oxidation\s+(M)"? No...
Does "MEVDIEERPK + 7 Oxidation (M)" match "\s*(\d*)\s*Oxidation\s+(M)"? No...
Does "EVDIEERPK + 7 Oxidation (M)" match "\s*(\d*)\s*Oxidation\s+(M)"? No...
...
Does " 7 Oxidation (M)" match "\s*(\d*)\s*Oxidation\s+(M)"? Yes
If for the first regular expression you'd used \d+
instead of \d*
you'd have got a better result.
This is not exactly how regular expressions work, but you get the idea.
Upvotes: 3
Reputation: 118865
". * ?" in this example will always match zero characters, since "* ?" does shortest possible match. As a result, since the thing right before the 'O' is a space, "\ d *" can match 0 digits.
(Sorry about the spaces in the quotes; the auto-formatter was eating my syntax.)
Reference: Quantifiers in Regular Expressions (MSDN)
Upvotes: 1
Reputation:
I settled on using \w*
for now. The user will be required
to specify matching for any white space, but it covers the
majority of cases for this particular application and how it
is commonly used.
So for the example the regular expression is then:
\s*(\d*)\s*\w*Oxidation\s+\(M\)
Upvotes: 1
Reputation: 179779
With the syntax update, it seems we don't need to worry about the difference between \d+ and \d*. There's always a + sign present, even if there are no digits. Matching this + constrains the regex to the point that it works as expected:
"\s* // whitespace before +
\+ // The + sign itself
\s* // whitespace after +
(\d*) // optional digits
.*? // any non-digit between the last digit and Oxidation (M)
Oxidation\s+\(M\)"
Since the + must be matched first, and must be matched precisely once, the AMEVDIEERPK prefix cannot be matched by .*?.
Upvotes: 1
Reputation: 41132
To answer your second message, you (or your user) can specify \w*dation\s+\(M\)
to match either Oxydation (M) or Gradation (M) or dation (M).
Upvotes: 1
Reputation:
Thanks for the quick responses!
The numbers in the input are left out if there is only one (peptide) modification instead of 7 as in the previous example, e.g.:
"AMEVDIEERPK + Oxidation (M)"
and there would be no match if "\d+" was used. But maybe I should use two regular expressions, one for each of these two cases. This would increase the complexity of the program somewhat (as I want to avoid memory garbage from constructing regular expression for each string to be matched), but is acceptable.
What I really wanted to do was to let the user specificy a match rule without requiring the rule to match from the beginning of the (peptide) modification (that's why I tried to introduce the non-greedy match).
Right now the user's rule is prepended with "\s*(\d*)\s*" and the user must thus specifify "Oxidation\s+(M)" to match. Specifying e.g. "dation\s+(M)" will not work.
Upvotes: 1
Reputation: 46754
\s* Zero or more whitespace
(\d*) Zero or more digits (captured)
.*? Any characters (non greedy, so up to the next match
Oxidation Matches the word Oxidation
\s+(M) Matches with one or more whitespace then (M)
The problem here is that you are matching 0 or more of any characters prior to the word Oxidation, including any possible digits, eating the digits which might match the previous \d
\s*(\d*)\s*Oxidation\s+(M)
The difference here is that you are specifying whitespace only before the Oxidation. Not eating the digits.
Change the \d* to \d+ to catch the numbers
Upvotes: 4