Tono Nam
Tono Nam

Reputation: 36048

Parse StreamReader using regex efficiently

I have the variable

    StreamReader DebugInfo = GetDebugInfo();
    var text = DebugInfo.ReadToEnd();  // takes 10 seconds!!! because there are a lot of students

text equals:

<student>
    <firstName>Antonio</firstName>
    <lastName>Namnum</lastName>
</student>
<student>
    <firstName>Alicia</firstName>
    <lastName>Garcia</lastName>
</student>
<student>
    <firstName>Christina</firstName>
    <lastName>SomeLattName</lastName>
</student>
... etc
.... many more students

what am I doing now is:

  StreamReader DebugInfo = GetDebugInfo();
  var text = DebugInfo.ReadToEnd(); // takes 10 seconds!!!

  var mtch = Regex.Match(text , @"(?s)<student>.+?</student>");
  // keep parsing the file while there are more students
  while (mtch.Success)
  {
     AddStudent(mtch.Value); // parse text node into object and add it to corresponding node
     mtch = mtch.NextMatch();
  } 

the whole process takes about 25 seconds. to convert the streamReader to text (var text = DebugInfo.ReadToEnd();) that takes 10 seconds. the other part takes about 15 seconds. I was hoping I could do the two parts at the same time...


EDIT

I will like to have something like:

    const int bufferSize = 1024;

    var sb = new StringBuilder();

    Task.Factory.StartNew(() =>
    {
         Char[] buffer = new Char[bufferSize];
         int count = bufferSize;

         using (StreamReader sr = GetUnparsedDebugInfo())
         {

             while (count > 0)
             {
                 count = sr.Read(buffer, 0, bufferSize);
                 sb.Append(buffer, 0, count);
             }
         }

         var m = sb.ToString();
     });

     Thread.Sleep(100);

     // meanwhile string is being build start adding items

     var mtch = Regex.Match(sb.ToString(), @"(?s)<student>.+?</student>"); 

     // keep parsing the file while there are more nodes
     while (mtch.Success)
     {
         AddStudent(mtch.Value);
         mtch = mtch.NextMatch();
     }  

Edit 2

Summary

I forgot to mention sorry the text is very similar to xml but it is not. That's why I have to use regular expressions... In short I think I could save time because what am I doing is converting the stream to a string then parsing the string. why not just parse the stream with a regex. Or if that is not possible why not get a chunk of the stream and parse that chunk in a separate thread.

Upvotes: 4

Views: 6932

Answers (5)

kakridge
kakridge

Reputation: 2293

UPDATED:

This basic code reads a (roughly) 20 megabyte file in .75 seconds. My machine should roughly process 53.33 megabytes in that 2 seconds that you reference. Further, 20,000,000 / 2,048 = 9765.625. .75 / 9765.625 = .0000768. That means that you are roughly reading 2048 characters every 768 hundred-thousandths of a second. You need to understand the cost of context switching in relation to the timing of your iterations to determine whether the added complexity of multi-threading is appropriate. At 7.68X10^5 seconds, I see your reader thread sitting idle most of the time. It doesn't make sense to me. Just use a single loop with a single thread.

char[] buffer = new char[2048];
StreamReader sr = new StreamReader(@"C:\20meg.bin");
while(sr.Read(buffer, 0, 2048) != 0)
{
    ; // do nothing
}

For large operations like this, you want to use a forward-only, non-cached reader. It looks like your data is XML, so an XmlTextReader is perfect for this. Here is some sample code. Hope this helps.

string firstName;
        string lastName;
        using (XmlTextReader reader = GetDebugInfo())
        {
            while (reader.Read())
            {
                if (reader.IsStartElement() && reader.Name == "student")
                {
                    reader.ReadToDescendant("firstName");
                    reader.Read();
                    firstName = reader.Value;
                    reader.ReadToFollowing("lastName");
                    reader.Read();
                    lastName = reader.Value;
                    AddStudent(firstName, lastName);
                }
            }
        }

I used the following XML:

<students>
    <student>
        <firstName>Antonio</firstName>
        <lastName>Namnum</lastName>
    </student>
    <student>
        <firstName>Alicia</firstName>
        <lastName>Garcia</lastName>
    </student>
    <student>
        <firstName>Christina</firstName>
        <lastName>SomeLattName</lastName>
    </student>
</students>

You may need to tweak. This should run much, much faster.

Upvotes: 2

Tono Nam
Tono Nam

Reputation: 36048

@kakridge was right. I could be dealing with a race condition where one task is writing listToProces[30] for example and another thread could be parsing listToProces[30]. To fix that problem and also to remove the Thread.Sleep methods that are ineficient I ended up using semaphores. Here is my new code:

        StreamReader unparsedDebugInfo = GetUnparsedDebugInfo(); // get streamReader 
        listToProcess = new char[200000][];
        lastPart = null;
        matchLength = 0;

        // Used to signal events between thread that is reading text 
        // from readelf.exe and the thread that is parsing chunks
        Semaphore semaphore = new Semaphore(0, 1);

        // If task1 run out of chunks to process it will be waiting for semaphore to post a message
        bool task1IsWaiting = false;

        // Used to note that there are no more chunks to add to listToProcess.
        bool mainTaskIsDone = false;

        int counter = 0; // keep trak of which chunk we have added to the list

        // This task will be executed on a separate thread. Meanwhile the other thread adds nodes to  
        // "listToProcess" array this task will add those chunks to the dictionary. 
        var task1 = Task.Factory.StartNew(() =>
        {
            semaphore.WaitOne(); // wait until there are at least 1024 nodes to be processed

            int indexOnList = 0; // counter to identify the index of chunk[] we are adding to dictionary

            while (true)
            {
                if (indexOnList>=counter)   // if equal it might be dangerous! 
                {                           // chunk could be being written to and at the same time being parsed.
                    if (mainTaskIsDone)// if the main task is done executing stop
                        break;

                    task1IsWaiting = true; // otherwise wait until there are more chunks to be processed
                    semaphore.WaitOne();
                }

                ProcessChunk(listToProcess[indexOnList]); // add chunk to dictionary
                indexOnList++;
            }
        });


        // this block being executed on main thread  is responsible for placing the streamreader 
        // into chunks of char[] so that task1 can start processing those chunks
        {                
            int waitCounter = 1024; // every time task1 is waiting we use this counter to place at least 256 new chunks before continue to parse them

            while (true) // more chunks on listToProcess before task1 continues executing
            {
                char[] buffer = new char[2048]; // buffer where we will place data read from stream

                var charsRead = unparsedDebugInfo.Read(buffer, 0, buffer.Length);

                if (charsRead < 1){
                    listToProcess[counter] = pattern;
                    break;
                }

                listToProcess[counter] = buffer;
                counter++; // add chunk to list to be proceesed by task1.

                if (task1IsWaiting)
                {               // if task1 is waiting for more nodes process 256
                    waitCounter = counter + 256;    // more nodes then continue execution of task2
                    task1IsWaiting = false;
                }
                else if (counter == waitCounter)                    
                    semaphore.Release();                    
            }
        }

        mainTaskIsDone = true; // let other thread know that this task is done

        semaphore.Release(); // release all threads that might be waiting on this thread

        task1.Wait(); // wait for all nodes to finish processing

Upvotes: 0

Tono Nam
Tono Nam

Reputation: 36048

Here is what turn out to be the fastest (maybe I mist more things to try)

Created an array of arrays char[][] listToProcess = new char[200000][]; where I will place chunks of the stream. On a separate task I started to process each chunk. The code looks like:

   StreamReader sr = GetUnparsedDebugInfo(); // get streamReader                        

   var task1 = Task.Factory.StartNew(() =>
   {
       Thread.Sleep(500); // wait a little so there are items on list (listToProcess) to work with
       StartProcesingList();
   });

   int counter = 0;

   while (true)
   {
       char[] buffer = new char[2048]; // crate a new buffer each time we will add it to the list to process

       var charsRead = sr.Read(buffer, 0, buffer.Length);

       if (charsRead < 1) // if we reach the end then stop
       {
           break;
       }

       listToProcess[counter] = buffer;
       counter++;
   }

   task1.Wait();

and the method StartProcesingList() basically starts going through the list until it reaches a null object.

    void StartProcesingList()
    {
        int indexOnList = 0;

        while (true)
        {
            if (listToProcess[indexOnList] == null)
            {
                Thread.Sleep(100); // wait a little in case other thread is adding more items to the list

                if (listToProcess[indexOnList] == null)
                    break;
            }

            // add chunk to dictionary if you recall listToProcess[indexOnList] is a 
            // char array so it basically converts that to a string and splits it where appropiate
            // there is more logic as in the case where the last chunk will have to be 
            // together with the first chunk of the next item on the list
            ProcessChunk(listToProcess[indexOnList]);

            indexOnList++;                
        }

    }

Upvotes: 1

Ivan Nikitin
Ivan Nikitin

Reputation: 4133

RegEx isn't the fastest way to parse a string. You need a tailored parser similar to XmlReader (to match your data structure). It will allow you to read the file partially and parse it much faster than RegEx does.

Since you have a limited set of tags and nesting FSM approach (http://en.wikipedia.org/wiki/Finite-state_machine) will work for you.

Upvotes: 1

Alexei Levenkov
Alexei Levenkov

Reputation: 100527

You can read line-by-line, but if reading of the data takes 15 seconds there is not much you can do to speed things up.

Before making any significant changes try to simply read all lines of the file and do no processing. If that still takes longer that your goal - adjust goals/change file format. Otherwise see how much gains you can expect from optimizing parsing - RegEx are quite fast for non-complicated regular expressions.

Upvotes: 1

Related Questions