Reputation: 3957
I need to do the following:
static string[] pats = { "å", "Å", "æ", "Æ", "ä", "Ä", "ö", "Ö", "ø", "Ø" ,"è", "È", "à", "À", "ì", "Ì", "õ", "Õ", "ï", "Ï" };
static string[] repl = { "a", "A", "a", "A", "a", "A", "o", "O", "o", "O", "e", "E", "a", "A", "i", "I", "o", "O", "i", "I" };
static int i = pats.Length;
int j;
// function for the replacement(s)
public string DoRepl(string Inp) {
string tmp = Inp;
for( j = 0; j < i; j++ ) {
tmp = Regex.Replace(tmp,pats[j],repl[j]);
}
return tmp.ToString();
}
/* Main flow processes about 45000 lines of input */
Each line has 6 elements that go through DoRepl. Approximately 300,000 function calls. Each does 20 Regex.Replace, totalling ~6 million replaces.
Is there any more elegant way to do this in fewer passes?
Upvotes: 23
Views: 6754
Reputation: 64628
Without regex it might be way faster.
for( j = 0; j < i; j++ )
{
tmp = tmp.Replace(pats[j], repl[j]);
}
Edit
Another way using Zip
and a StringBuilder
:
StringBuilder result = new StringBuilder(input);
foreach (var zipped = patterns.Zip(replacements, (p, r) => new {p, r}))
{
result = result.Replace(zipped.p, zipped.r);
}
return result.ToString();
Upvotes: 10
Reputation: 9965
The fastest (IMHO) way (compared even with the dictionary) in the special case of one-to-one character replacement would be a full character map:
public class Converter
{
private readonly char[] _map;
public Converter()
{
// This code assumes char to be a short unsigned integer
_map = new char[char.MaxValue];
for (int i = 0; i < _map.Length; i++)
_map[i] = (char)i;
_map['å'] = 'a'; // Note that 'å' is used as an integer index into the array.
_map['Å'] = 'A';
_map['æ'] = 'a';
// ... the rest of overriding map
}
public string Convert(string source)
{
if (string.IsNullOrEmpty(source))
return source;
var result = new char[source.Length];
for (int i = 0; i < source.Length; i++)
result[i] = _map[source[i]]; // convert using the map
return new string(result);
}
}
To further speed up this code, you might want to use the "unsafe" keyword and use pointers. This way, traversing the string array could be done faster and without bound-checks (which in theory would be optimized away by the VM, but might not).
Upvotes: 1
Reputation: 262929
First, I would use a StringBuilder to perform the translation inside a buffer and avoid creating new strings all over the place.
Next, ideally we'd like something akin to XPath's translate()
, so we can work with strings instead of arrays or mappings. Let's do that in an extension method:
public static StringBuilder Translate(this StringBuilder builder,
string inChars, string outChars)
{
int length = Math.Min(inChars.Length, outChars.Length);
for (int i = 0; i < length; ++i) {
builder.Replace(inChars[i], outChars[i]);
}
return builder;
}
Then use it:
StringBuilder builder = new StringBuilder(yourString);
yourString = builder.Translate("åÅæÆäÄöÖøØèÈàÀìÌõÕïÏ",
"aAaAaAoOoOeEaAiIoOiI").ToString();
Upvotes: 3
Reputation: 96477
The problem with your original regex is that you're not using it to its fullest potential. Remember, a regex pattern can have alternations. You will still need a dictionary, but you can do it in one pass without looping through each character.
This would be achieved as follows:
string[] pats = { "å", "Å", "æ", "Æ", "ä", "Ä", "ö", "Ö", "ø", "Ø" ,"è", "È", "à", "À", "ì", "Ì", "õ", "Õ", "ï", "Ï" };
string[] repl = { "a", "A", "a", "A", "a", "A", "o", "O", "o", "O", "e", "E", "a", "A", "i", "I", "o", "O", "i", "I" };
// using Zip as a shortcut, otherwise setup dictionary differently as others have shown
var dict = pats.Zip(repl, (k,v) => new { Key = k, Value = v }).ToDictionary(o => o.Key, o => o.Value);
string input = "åÅæÆäÄöÖøØèÈàÀìÌõÕïÏ";
string pattern = String.Join("|", dict.Keys.Select(k => k)); // use ToArray() for .NET 3.5
string result = Regex.Replace(input, pattern, m => dict[m.Value]);
Console.WriteLine("Pattern: " + pattern);
Console.WriteLine("Input: " + input);
Console.WriteLine("Result: " + result);
Of course, you should always escape your pattern using Regex.Escape
. In this case this is not needed since we know the finite set of characters and they don't need to be escaped.
Upvotes: 2
Reputation: 31428
How about this "trick"?
string conv = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(input));
Upvotes: 12
Reputation: 918
If you want to remove accents then perhaps this solution would be helpful How do I remove diacritics (accents) from a string in .NET?
Otherwise I would to this in single pass:
Dictionary<char, char> replacements = new Dictionary<char, char>();
...
StringBuilder result = new StringBuilder();
foreach(char c in str)
{
char rc;
if (!_replacements.TryGetValue(c, out rc)
{
rc = c;
}
result.Append(rc);
}
Upvotes: 1
Reputation: 1755
I'm not familiar with the Regex class, but most regular expression engines have a transliterate operation that would work well here. Then you would only need one call per line.
Upvotes: 0
Reputation: 6723
static Dictionary<char, char> repl = new Dictionary<char, char>() { { 'å', 'a' }, { 'ø', 'o' } }; // etc...
public string DoRepl(string Inp)
{
var tmp = Inp.Select(c =>
{
char r;
if (repl.TryGetValue(c, out r))
return r;
return c;
});
return new string(tmp.ToArray());
}
Each char is checked only once against a dictionary and replaced if found in the dictionary.
Upvotes: 21