Fredy Treboux
Fredy Treboux

Reputation: 3331

Find substring ignoring specified characters

Do any of you know of an easy/clean way to find a substring within a string while ignoring some specified characters to find it. I think an example would explain things better:

Using Regex is not a requirement for me, but I added the tag because it feels related.

Update:

To make the requirement clearer: I need the resulting substring with the ignored chars, not just an indication that the given substring exists.

Update 2: Some of you are reading too much into the example, sorry, i'll give another scenario that should work:

And as a bonus (not required per se), it will be great if it's also not safe to assume that the substring to find will not have the ignored chars on it, e.g.: given the last example we should be able to do:

Sorry if I wasn't clear before, or still I'm not :).

Update 3:

Thanks to everyone who helped!, this is the implementation I'm working with for now:

An here are some tests:

I'm using some custom extension methods I'm not including but I believe they should be self-explainatory (I will add them if you like) I've taken a lot of your ideas for the implementation and the tests but I'm giving the answer to @PierrOz because he was one of the firsts, and pointed me in the right direction. Feel free to keep giving suggestions as alternative solutions or comments on the current state of the impl. if you like.

Upvotes: 8

Views: 6025

Answers (9)

Franck Dernoncourt
Franck Dernoncourt

Reputation: 83147

Posting Fredy Treboux's solution and tests here, in case the pastebin disappears:

Solution:

public static string SubstringSearch(this string text, string value)
{
    return text.SubstringSearch(value, null);
}

public static string SubstringSearch(this string text, string value, char[] charsToIgnore)
{
    if (text.IsNullOrEmpty() || value.IsNullOrEmpty()) return string.Empty;

    //No ignored chars
    if (charsToIgnore == null || charsToIgnore.Length == 0)
    {
        if (!text.Contains(value)) return string.Empty;
        var substringIndex = text.IndexOf(value);
        return text.Substring(substringIndex, value.Length);
    }
    else
    {
        //Use regex when there are ignored chars
        var regex = BuildSubstringSearchRegex(value, charsToIgnore);
        var match = regex.Match(text);
        return match.Success ? match.Value : string.Empty;
    }
}

private static Regex BuildSubstringSearchRegex(string value, char[] charsToIgnore)
{
    const string ignorePattern = "[{0}]*?";

    var ignoreString = string.Format(ignorePattern, "".Join(charsToIgnore.Select(x => SanitizeCharForRegex(x))));
    
    var regexString = new StringBuilder();
    foreach (var character in value)
    {
        regexString.Append(SanitizeCharForRegex(character) + ignoreString);
    }
    
    return new Regex(regexString.ToString(), RegexOptions.IgnoreCase);
}

private static string SanitizeCharForRegex(char character)
{
    //escape the dash
    if (character == '-') return @"\-";
    return Regex.Escape(character.ToString());
}

Tests:


/// <summary>
/// Should return empty if no substring found and no ignored chars.
/// </summary>
[TestMethod]
public void ShouldReturnEmptyIfNoSubstringFoundAndNoIgnoredChars()
{
    const string sut = "SomeString";
    const string substring = "DoesNotExist";

    Assert.AreEqual(string.Empty, sut.SubstringSearch(substring));
}

/// <summary>
/// Should return substring found if found and no ignored chars.
/// </summary>
[TestMethod]
public void ShouldReturnSubstringFoundIfFoundAndNoIgnoredChars()
{
    const string sut = "SomeString";
    const string substring = "Str";

    Assert.AreEqual(substring, sut.SubstringSearch(substring));
}

/// <summary>
/// Should return substring found if found while ignoring chars but ignored chars are nor present.
/// </summary>
[TestMethod]
public void ShouldReturnSubstringFoundIfFoundWhileIgnoringCharsButIgnoredCharsAreNorPresent()
{
    const string sut = "SomeString";
    const string substring = "Str";
    var ignoredChars = new[] { '/', '*', '(' };

    Assert.AreEqual(substring, sut.SubstringSearch(substring, ignoredChars));
}

/// <summary>
/// Should return substring found if found while ignoring chars.
/// </summary>
[TestMethod]
public void ShouldReturnSubstringFoundIfFoundWhileIgnoringChars()
{
    const string sut = "Some(S/t*/(*ring";
    const string substring = "Str";
    const string substringToFind = "S/t*/(*r";
    var ignoredChars = new[] { '/', '*', '(' };

    Assert.AreEqual(substringToFind, sut.SubstringSearch(substring, ignoredChars));
}

/// <summary>
/// Should return substring found without trailing ignored chars if found while ignoring chars.
/// </summary>
[TestMethod]
public void ShouldReturnSubstringFoundWithoutTrailingIgnoredCharsIfFoundWhileIgnoringChars()
{
    //This is to make sure implementation is not returning "S/t*/(*r/" in this case.
    const string sut = "Some(S/t*/(*r/ing";
    const string substring = "Str";
    const string substringToFind = "S/t*/(*r";
    var ignoredChars = new[] { '/', '*', '(' };

    Assert.AreEqual(substringToFind, sut.SubstringSearch(substring, ignoredChars));
}

/// <summary>
/// Should return substring found if found while ignoring chars even if substring to find has ignored chars on it.
/// </summary>
[TestMethod]
public void ShouldReturnSubstringFoundIfFoundWhileIgnoringCharsEvenIfSubstringToFindHasIgnoredCharsOnIt()
{
    const string sut = "Some(S/t*/(*ring";
    const string substring = "S/t/*r";
    const string substringToFind = "S/t*/(*r";
    var ignoredChars = new[] { '/', '*', '(' };

    Assert.AreEqual(substringToFind, sut.SubstringSearch(substring, ignoredChars));
}

/// <summary>
/// Should not take the dash as a range if searching substring and dash is ignored.
/// </summary>
[TestMethod]
public void ShouldNotTakeTheDashAsARangeIfSearchingSubstringAndDashIsIgnored()
{
    const string sut = "SomeStbring";
    const string substring = "Str";
    var ignoredChars = new[] { 'a', '-', 'z' };

    Assert.AreEqual(string.Empty, sut.SubstringSearch(substring, ignoredChars));
}


/// <summary>
/// Should pass stack overflow question use cases.
/// </summary>
/// <remarks>http://stackoverflow.com/questions/2592613/find-substring-ignoring-specified-characters</remarks>
[TestMethod]
public void ShouldPassStackOverflowQuestionUseCases()
{
    Assert.AreEqual("Hello, -this", "Hello, -this- is a string".SubstringSearch("Hello this", new[] { ',', '-' }));
    Assert.AreEqual("A&3/3/C)41", "?A&3/3/C)412&".SubstringSearch("A41", new[] { '&', '/', '3', 'C', ')' }));
    Assert.AreEqual("A&3/3/C)412&", "?A&3/3/C)412&".SubstringSearch("A3C412&", new[] { '&', '/', '3', 'C', ')' }));
}

I converted it to Python: How to find all occurrences of a substring in a string while ignore some characters in Python?

Upvotes: 1

pierroz
pierroz

Reputation: 7870

in your example you would do:

string input = "Hello, -this-, is a string";
string ignore = "[-,]*";
Regex r = new Regex(string.Format("H{0}e{0}l{0}l{0}o{0} {0}t{0}h{0}i{0}s{0}", ignore));
Match m = r.Match(input);
return m.Success ? m.Value : string.Empty;

Dynamically you would build the part [-, ] with all the characters to ignore and you would insert this part between all the characters of your query.

Take care of '-' in the class []: put it at the beginning or at the end

So more generically, it would give something like:

public string Test(string query, string input, char[] ignorelist)
{
    string ignorePattern = "[";
    for (int i=0; i<ignoreList.Length; i++)
    {
        if (ignoreList[i] == '-')
        {
            ignorePattern.Insert(1, "-");
        }
        else
        {
            ignorePattern += ignoreList[i];
        }
    }

    ignorePattern += "]*";

    for (int i = 0; i < query.Length; i++)
    {
        pattern += query[0] + ignorepattern;
    }

    Regex r = new Regex(pattern);
    Match m = r.Match(input);
    return m.IsSuccess ? m.Value : string.Empty;
}

Upvotes: 1

Jamie Altizer
Jamie Altizer

Reputation: 1542

You could always use a combination of RegEx and string searching

public class RegExpression {

  public static void Example(string input, string ignore, string find)
  {
     string output = string.Format("Input: {1}{0}Ignore: {2}{0}Find: {3}{0}{0}", Environment.NewLine, input, ignore, find);
     if (SanitizeText(input, ignore).ToString().Contains(SanitizeText(find, ignore)))
        Console.WriteLine(output + "was matched");
     else
        Console.WriteLine(output + "was NOT matched");
     Console.WriteLine();
  }

  public static string SanitizeText(string input, string ignore)
  {
     Regex reg = new Regex("[^" + ignore + "]");
     StringBuilder newInput = new StringBuilder();
     foreach (Match m in reg.Matches(input))
     {
        newInput.Append(m.Value);
     }
     return newInput.ToString();
  }

}

Usage would be like

RegExpression.Example("Hello, -this- is a string", "-,", "Hello this");  //Should match
RegExpression.Example("Hello, -this- is a string", "-,", "Hello this2"); //Should not match
RegExpression.Example("?A&3/3/C)412&", "&/3C\\)", "A41"); // Should match
RegExpression.Example("?A&3/3/C) 412&", "&/3C\\)", "A41"); // Should not match
RegExpression.Example("?A&3/3/C)412&", "&/3C\\)", "A3C412&"); // Should match

Output

Input: Hello, -this- is a string Ignore: -, Find: Hello this

was matched

Input: Hello, -this- is a string Ignore: -, Find: Hello this2

was NOT matched

Input: ?A&3/3/C)412& Ignore: &/3C) Find: A41

was matched

Input: ?A&3/3/C) 412& Ignore: &/3C) Find: A41

was NOT matched

Input: ?A&3/3/C)412& Ignore: &/3C) Find: A3C412&

was matched

Upvotes: 0

Ahmad Mageed
Ahmad Mageed

Reputation: 96477

EDIT: here's an updated solution addressing the points in your recent update. The idea is the same except if you have one substring it will need to insert the ignore pattern between each character. If the substring contains spaces it will split on the spaces and insert the ignore pattern between those words. If you don't have a need for the latter functionality (which was more in line with your original question) then you can remove the Split and if checking that provides that pattern.

Note that this approach is not going to be the most efficient.

string input = @"foo ?A&3/3/C)412& bar A341C2";
string substring = "A41";
string[] ignoredChars = { "&", "/", "3", "C", ")" };

// builds up the ignored pattern and ensures a dash char is placed at the end to avoid unintended ranges
string ignoredPattern = String.Concat("[",
                            String.Join("", ignoredChars.Where(c => c != "-")
                                                        .Select(c => Regex.Escape(c)).ToArray()),
                            (ignoredChars.Contains("-") ? "-" : ""),
                            "]*?");

string[] substrings = substring.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);

string pattern = "";
if (substrings.Length > 1)
{
    pattern = String.Join(ignoredPattern, substrings);
}
else
{
    pattern = String.Join(ignoredPattern, substring.Select(c => c.ToString()).ToArray());
}

foreach (Match match in Regex.Matches(input, pattern))
{
    Console.WriteLine("Index: {0} -- Match: {1}", match.Index, match.Value);
}


Try this solution out:

string input = "Hello, -this- is a string";
string[] searchStrings = { "Hello", "this" };
string pattern = String.Join(@"\W+", searchStrings);

foreach (Match match in Regex.Matches(input, pattern))
{
    Console.WriteLine(match.Value);
}

The \W+ will match any non-alphanumeric character. If you feel like specifying them yourself, you can replace it with a character class of the characters to ignore, such as [ ,.-]+ (always place the dash character at the start or end to avoid unintended range specifications). Also, if you need case to be ignored use RegexOptions.IgnoreCase:

Regex.Matches(input, pattern, RegexOptions.IgnoreCase)

If your substring is in the form of a complete string, such as "Hello this", you can easily get it into an array form for searchString in this way:

string[] searchString = substring.Split(new[] { ' ' },
                            StringSplitOptions.RemoveEmptyEntries);

Upvotes: 1

msarchet
msarchet

Reputation: 15242

You could do something like this, since most all of these answer require rebuilding the string in some form.

string1 is your string you want to look through

//Create a List(Of string) that contains the ignored characters'
List<string> ignoredCharacters = new List<string>();

//Add all of the characters you wish to ignore in the method you choose

//Use a function here to get a return

public bool subStringExist(List<string> ignoredCharacters, string myString, string toMatch)
{
    //Copy Your string to a temp

    string tempString = myString;
    bool match = false;

    //Replace Everything that you don't want

    foreach (string item in ignoredCharacters)
    {
        tempString = tempString.Replace(item, "");
    }

    //Check if your substring exist
    if (tempString.Contains(toMatch))
    {
        match = true;
    }
    return match;
}

Upvotes: 0

DShultz
DShultz

Reputation: 4541

Here's a non-regex way to do it using string parsing.

    private string GetSubstring()

    {
        string searchString = "Hello, -this- is a string";
        string searchStringWithoutUnwantedChars = searchString.Replace(",", "").Replace("-", "");

        string desiredString = string.Empty;
        if(searchStringWithoutUnwantedChars.Contains("Hello this"))
            desiredString = searchString.Substring(searchString.IndexOf("Hello"), searchString.IndexOf("this") + 4);

        return desiredString;
    }

Upvotes: 0

300 baud
300 baud

Reputation: 1672

Here's a non-regex string extension option:

public static class StringExtensions
{
    public static bool SubstringSearch(this string s, string value, char[] ignoreChars, out string result)
    {
        if (String.IsNullOrEmpty(value))
            throw new ArgumentException("Search value cannot be null or empty.", "value");

        bool found = false;
        int matches = 0;
        int startIndex = -1;
        int length = 0;

        for (int i = 0; i < s.Length && !found; i++)
        {
            if (startIndex == -1)
            {
                if (s[i] == value[0])
                {
                    startIndex = i;
                    ++matches;
                    ++length;
                }
            }
            else
            {
                if (s[i] == value[matches])
                {
                    ++matches;
                    ++length;
                }
                else if (ignoreChars != null && ignoreChars.Contains(s[i]))
                {
                    ++length;
                }
                else
                {
                    startIndex = -1;
                    matches = 0;
                    length = 0;
                }
            }

            found = (matches == value.Length);
        }

        if (found)
        {
            result = s.Substring(startIndex, length);
        }
        else
        {
            result = null;
        }
        return found;
    }
}

Upvotes: 1

Martin Smith
Martin Smith

Reputation: 453028

You could do this with a single Regex but it would be quite tedious as after every character you would need to test for zero or more ignored characters. It is probably easier to strip all the ignored characters with Regex.Replace(subject, "[-,]", ""); then test if the substring is there.

Or the single Regex way

Regex.IsMatch(subject, "H[-,]*e[-,]*l[-,]*l[-,]*o[-,]* [-,]*t[-,]*h[-,]*i[-,]*s[-,]*")

Upvotes: 0

Jaxidian
Jaxidian

Reputation: 13511

This code will do what you want, although I suggest you modify it to fit your needs better:

string resultString = null;

try
{
    resultString = Regex.Match(subjectString, "Hello[, -]*this", RegexOptions.IgnoreCase).Value;
}
catch (ArgumentException ex)
{
    // Syntax error in the regular expression
}

Upvotes: 0

Related Questions