LukeHennerley
LukeHennerley

Reputation: 6444

Using REGEX to process file contents instead of a line by line basis

I am trying to use C# to split a SQL script which contains regions by using Regex.Split() which I can't seem to get the pattern for - I really struggle with the concept of Regex and find it completely bewildering in most circumstances, although I do understand it to be the best solution to achieve the following.

Input string (which is 100'000* the below hence the sluggishness of my method)

--#region someregioncomment
aaaa
bbbb
--#endregion 

Where each return is \r\n.

Output Dictionary<string, string>

At the moment I am doing this:

Dictionary<string, string> regionValues = new Dictionary<string, string>();
using (StringReader sr = new StringReader(SSBS))
{
  string strCurrentRegion = "";
  string strCurrentRegionContents = "";
  while (sr.Peek() != -1)
  {
    string strCurrentLine = sr.ReadLine();
    if (strCurrentLine.Contains("--#region"))
    {
      strCurrentRegion = strCurrentLine;
    }
    if (string.IsNullOrEmpty(strCurrentRegion))
    {
      continue;
    }
    else if (strCurrentLine.Contains("--#endregion"))
    {
      regionValues.Add(strCurrentRegion, strCurrentRegionContents);
      strCurrentRegion = "";
    }
    else
    {
      strCurrentRegionContents += ("\r\n" + strCurrentLine);
    }
  }
}

However I felt that this could be achieved with a Regex pattern combined with Regex.Split() - I can't seem to get the jist of what the pattern should look like...

I have atttempted:

(--#region.*?)\n
(--#region)\w*

I just can't seem to get it! Any help for my desired output appreciated :)

Thanks.

Upvotes: 0

Views: 482

Answers (1)

C&#233;dric Bignon
C&#233;dric Bignon

Reputation: 13022

The problem with String.Split and the Regex is it loads the whole file into memory. So, why don't you read the script line by line with a StreamReader?

Dictionary<string, string> regions = new Dictionary<string, string>();

string regionName = null;
StringBuilder regionString = new StringBuilder();
using (StreamReader streamReader = File.OpenText("MyFile.txt"))
{
    while (!streamReader.EndOfStream)
    {
        string line = streamReader.ReadLine();

        if (line.StartsWith("--#region "))         // Beginning of the region
        {
            regionName = line.Substring(10);
        }
        else if (line.StartsWith("--#endregion"))  // End of the region
        {
            if (regionName == null)
                throw new InvalidDataException("#endregion found without a #region.");
            regions.Add(regionName, regionString.ToString());
            regionString.Clear();
        }
        else if (regionName != null) // If the line is in a region
        {
            regionString.AppendLine(line);
        }
    }
}

Be careful with the Dictionary. If your file contains multiple regions with the same name. It will crash.

Few advices:

  • Use StringBuilder instead of concatenating the string (for better performance).
  • Use String.StartsWith instead of String.Contains for 2 reasons: performance (StartWith is easier to check, and imagine you have a string containing "--#region" in your SQL what happen ?!).
  • To create a new line, don't use "\r\n" which is environment specific, but Environment.NewLine instead.
  • sr.Peek() shouldn't be used to test the end of the file/stream. There is a property designed for this: StreamReader.EndOfStream.

Upvotes: 2

Related Questions