Reputation: 759
I have a problem finding all occurences of a pattern in a string.
Check this string :
string msg= "=?windows-1258?B?UkU6IFRyIDogUGxhbiBkZSBjb250aW51aXTpIGQnYWN0aXZpdOkgZGVz?= =?windows-1258?B?IHNlcnZldXJzIFdlYiBHb1ZveWFnZXN=?=";
I want to return the 2 occurrences (in order to later decode them):
=?windows-1258?B?UkU6IFRyIDogUGxhbiBkZSBjb250aW51aXTpIGQnYWN0aXZpdOkgZGVz?=
and
=?windows-1258?B?IHNlcnZldXJzIFdlYiBHb1ZveWFnZXN=?="
With the following regex code, it returns only 1 occurrence: the full string.
var charSetOccurences = new Regex(@"=\?.*\?B\?.*\?=", RegexOptions.IgnoreCase);
var charSetMatches = charSetOccurences.Matches(input);
foreach (Match match in charSetMatches)
{
charSet = match.Groups[0].Value.Replace("=?", "").Replace("?B?", "").Replace("?b?", "");
}
Do you know what I'm missing?
Upvotes: 4
Views: 30122
Reputation: 71538
A non-regex way:
string msg= "=?windows-1258?B?UkU6IFRyIDogUGxhbiBkZSBjb250aW51aXTpIGQnYWN0aXZpdOkgZGVz?= =?windows-1258?B?IHNlcnZldXJzIFdlYiBHb1ZveWFnZXN=?=";
string[] charSetOccurences = msg.Split(new string[]{ " " }, StringSplitOptions.None);
foreach (string s in charSetOccurences)
{
string charSet = s.Replace("=?", "").Replace("?B?", "").Replace("?b?", "");
Console.WriteLine(charSet);
}
See an ideone.
And if you still want to use regex, you should make the .*
lazy by adding a ?
. This was already mentioned by the previous users, but it seems you are not getting matches?
string msg= "=?windows-1258?B?UkU6IFRyIDogUGxhbiBkZSBjb250aW51aXTpIGQnYWN0aXZpdOkgZGVz?= =?windows-1258?B?IHNlcnZldXJzIFdlYiBHb1ZveWFnZXN=?=";
var charSetOccurences = new Regex(@"=\?.*?\?B\?.*?\?=", RegexOptions.IgnoreCase);
var charSetMatches = charSetOccurences.Matches(msg);
foreach (Match match in charSetMatches)
{
string charSet = match.Groups[0].Value.Replace("=?", "").Replace("?B?", "").Replace("?b?", "");
Console.WriteLine(charSet);
}
See another ideone.
The output is the same in both cases:
windows-1258UkU6IFRyIDogUGxhbiBkZSBjb250aW51aXTpIGQnYWN0aXZpdOkgZGVz?=
windows-1258IHNlcnZldXJzIFdlYiBHb1ZveWFnZXN=
EDIT: As per update, see an all in one solution for your problem
string msg= "=?windows-1258?B?UkU6IFRyIDogUGxhbiBkZSBjb250aW51aXTpIGQnYWN0aXZpdOkgZGVz?= =?windows-1258?B?IHNlcnZldXJzIFdlYiBHb1ZveWFnZXN=?=";
var charSetOccurences = new Regex(@"=\?.*?\?[BQ]\?.*?\?=", RegexOptions.IgnoreCase);
MatchCollection matches = charSetOccurences.Matches(msg);
foreach (Match match in matches)
{
string[] encoding = match.Groups[0].Value.Split(new string[]{ "?" }, StringSplitOptions.None);
string charSet = encoding[1];
string encodeType = encoding[2];
string encodedString = encoding[3];
Console.WriteLine("Charset: " + charSet);
Console.WriteLine("Encoding type: " + encodeType);
Console.WriteLine("Encoded String: " + encodedString + "\n");
}
Returns:
Charset: windows-1258
Encoding type: B
Encoded String: UkU6IFRyIDogUGxhbiBkZSBjb250aW51aXTpIGQnYWN0aXZpdOkgZGVz
Charset: windows-1258
Encoding type: B
Encoded String: IHNlcnZldXJzIFdlYiBHb1ZveWFnZXN=
See this.
Or since we already had the regex, we can use:
string msg= "=?windows-1258?B?UkU6IFRyIDogUGxhbiBkZSBjb250aW51aXTpIGQnYWN0aXZpdOkgZGVz?= =?windows-1258?B?IHNlcnZldXJzIFdlYiBHb1ZveWFnZXN=?=";
var charSetOccurences = new Regex(@"=\?(.*?)\?([BQ])\?(.*?)\?=", RegexOptions.IgnoreCase);
MatchCollection matches = charSetOccurences.Matches(msg);
foreach (Match match in matches)
{
Console.WriteLine("Charset: " + match.Groups[1].Value);
Console.WriteLine("Encoding type: " + match.Groups[2].Value);
Console.WriteLine("Encoded String: " + match.Groups[3].Value + "\n");
}
Upvotes: 2
Reputation:
When regexp
parser sees the .*
character sequence, it matches everything up to the end of the string and goes back, char by char, (greedy match). So, to avoid the problem, you can use a non-greedy match or explicitly define the characters that can appear at a string.
"=\?[a-zA-Z0-9?=-]*\?B\?[a-zA-Z0-9?=-]*\?="
Upvotes: 3
Reputation: 91452
.*
is greedy and will match everything from the first ?
to the last ?B?
.
You need to use either a non-greedy match
=\?.*?\?B\?.*?\?=
or exclude ?
from your list of characters
=\?[^?]*\?B\?[^?]*\?=
Upvotes: 1