priyanka.sarkar
priyanka.sarkar

Reputation: 26508

Replace characters in C#

I have a requirement.

I have a text which can contain any characters.

a) I have to retain only Alphanumeric characters b) If the word "The" is found with a space prefixed or suffixed with the word, that needs to be removed.

e.g.

CASE 1:

 Input:  The Company Pvt Ltd. 

 Output: Company Pvt Ltd

But 

     Input:  TheCompany Pvt Ltd. 

     Output: TheCompany Pvt Ltd

because there is no space between The & Company words.

CASE 2:

Similarly, Input:  Company Pvt Ltd.  The 

     Output: Company Pvt Ltd

But Input:  Company Pvt Ltd.The 

     Output: Company Pvt Ltd

Case 3:

Input: Company@234 Pvt; Ltd.

Output: Company234 Pvt Ltd

No , or . or any other special characters.

I am basically setting the data to some variable like

 _company.ShortName = _company.CompanyName.ToUpper();

So at the time of saving I cannot do anything. Only when I am getting the data from the database, then I need to apply this filter. The data is coming in _company.CompanyName

and I have to apply the filter on that.

So far I have done

public string ReplaceCharacters(string words)
{
    words = words.Replace(",", " ");
    words = words.Replace(";", " ");
    words = words.Replace(".", " ");
    words = words.Replace("THE ", " ");
    words = words.Replace(" THE", " ");
    return words;
}

private void button1_Click(object sender, EventArgs e)
{
    MessageBox.Show(ReplaceCharacters(textBox1.Text.ToUpper()));
}

Thanks in advance. I am using C#

Upvotes: 2

Views: 736

Answers (2)

David Hall
David Hall

Reputation: 33153

Here is a basic regex that matches your supplied cases. With the caveat that as Kobi says, your supplied cases are inconsistent, so I've taken the periods out of the first four tests. If you need both, please add a comment.

This handles all the cases you require, but the rapid proliferation of edge cases makes me think that maybe you should reconsider the initial problem?

    [TestMethod]
    public void RegexTest()
    {
        Assert.AreEqual("Company Pvt Ltd", RegexMethod("The Company Pvt Ltd"));
        Assert.AreEqual("TheCompany Pvt Ltd", RegexMethod("TheCompany Pvt Ltd"));
        Assert.AreEqual("Company Pvt Ltd", RegexMethod("Company Pvt Ltd. The"));
        Assert.AreEqual("Company Pvt LtdThe", RegexMethod("Company Pvt Ltd.The"));
        Assert.AreEqual("Company234 Pvt Ltd", RegexMethod("Company@234 Pvt; Ltd."));
        // Two new tests for new requirements
        Assert.AreEqual("CompanyThe Ltd", RegexMethod("CompanyThe Ltd."));
        Assert.AreEqual("theasdasdatheapple", RegexMethod("the theasdasdathe the the the ....apple,,,, the"));
        // And the case where you have THETHE at the start
        Assert.AreEqual("CCC", RegexMethod("THETHE CCC"));
    }

    public string RegexMethod(string input)
    {   
        // Old method before new requirement          
        //return Regex.Replace(input, @"The | The|[^A-Z0-9\s]", string.Empty, RegexOptions.IgnoreCase);  
        // New method that anchors the first the          
        //return Regex.Replace(input, @"^The | The|[^A-Z0-9\s]", string.Empty, RegexOptions.IgnoreCase);            
        // And a third method that does look behind and ahead for the last test
        return Regex.Replace(input, @"^(The)+\s|\s(?<![A-Z0-9])[\s]*The[\s]*(?![A-Z0-9])| The$|[^A-Z0-9\s]", string.Empty, RegexOptions.IgnoreCase);
    }

I've also added a test method to my example that exercises the RegexMethod that contains the regular expression. To use this in your code you just need the second method.

Upvotes: 10

Kobi
Kobi

Reputation: 138037

string company = "Company; PvtThe Ltd.The  . The the.the";
company = Regex.Replace(company, @"\bthe\b", "", RegexOptions.IgnoreCase);
company = Regex.Replace(company, @"[^\w ]", "");
company = Regex.Replace(company, @"\s+", " ");
company = company.Trim();
// company == "Company PvtThe Ltd"

These are the steps. 1 and 2 can be combined, but this is more clear.

  1. Remove "the" as a whole word (also works for ".the").
  2. Remove anything that isn't a letter or space.
  3. Remove all adjacent spaces.
  4. Remove spaces from the edges.

Upvotes: 2

Related Questions