Reputation: 383

How do I remove all tabs BEFORE a string (not after) efficiently?

I have a large text file I'm trying to parse. In order to parse this file I need to remove ALL tabs before a string, but not after.

So for example...

string sample = "\t\t\tThis is a string \t with a tab";
sample = RemoveInitialTabs(sample);
// sample should now be "This is a string \t with a tab";

I currently do this by reading in the file into an array (delimited by newline characters), iterating through a each line, and then for each line I adjust the string until a non tab character is reached like so....

for (int i = file_content.Count - 1; i > -1; i--)
{
   // Remove initial tabs...
   int size = 0;
   for (int j = 0; j < file_content[i].Length; j++)
   {
      if (file_content[i][j] != '\t')
      {
         break;
      }
      else
      {
         size = j + 1;
      }
   }

   if (size > 0)
   {
      file_content[i] = file_content[i].Remove(0, size);
   }
}

This works, but it is very slow (due to the size of the content in the file, a run typically takes about 66,453ms ONLY to remove the tabs)....

Any ideas?

Upvotes: 2

Answers (3)

MarkusEgle

Reputation: 3065

Try it with regex:

string pattern = "^\s*";
for (int i = file_content.Count - 1; i > -1; i--)
{   
    file_content[i] = Regex.Replace(file_content[i], pattern, String.Empty));
}

Upvotes: 0

GhostCat

Reputation: 140475

The one place I think you could save (just a tiny bit): why read all the strings first, to then process them and deal with all the copying of two huge arrays?!

What I mean: why don't you remove the leading tabs right when reading in your text files?

On the other hand; your current solution upholds the "separation of responsibilities" principle. And it opens the door for one potential improvement regarding overall runtime: after reading the initial content, you could just slice that array and use multiple threads to trim different parts of that array in parallel.

You see, in the end you are talking about a costly operation: changing the start of a string will mean copying around of strings (at least in most languages). And no matter if you do it with your own code, or using regexes, or using TrimStart() ... you won't be able to get below a certain "price tag". But assuming that we are talking about really huge arrays (probably 100s of thousands of lines); then processing the lines in parallel could allow you for significant reduction of overall runtime.

Upvotes: 2

Bojan B

Reputation: 2111

I think what could help you is TrimStart(params char[] trimChars) MSDN Link

For example you can use this:

sample = sample.TrimStart(new char[] {'\t'});

The output of this would be as desired.

Upvotes: 2

How do I remove all tabs BEFORE a string (not after) efficiently?

Answers (3)

Related Questions