alexandrekow
alexandrekow

Reputation: 1937

Replace the start of line in a file quickly

I have an initial file containing lines such as:

34    964:0.049759 1123:0.0031 2507:0.015979 
32,48 524:0.061167 833:0.030133 1123:0.002549
34,52 534:0.07349 698:0.141667 1123:0.004403 
106   389:0.013396 417:0.016276 534:0.023859

The first part of a line is the class number. A line can have several classes.

For each class, I create a new file.

For instance for class 34 the resulting file will be :

+1 964:0.049759 1123:0.0031 2507:0.015979 
-1 524:0.061167 833:0.030133 1123:0.002549
+1 534:0.07349 698:0.141667 1123:0.004403 
-1 389:0.013396 417:0.016276 534:0.023859

For class 106 the resulting file will be :

-1 964:0.049759 1123:0.0031 2507:0.015979 
-1 524:0.061167 833:0.030133 1123:0.002549
-1 534:0.07349 698:0.141667 1123:0.004403 
+1 389:0.013396 417:0.016276 534:0.023859

The problem is I have 13 files to write for 200 class. I already ran a less optimized version of my code and it took several hours. With my code below it takes 1 hour to generate the 2600 files.

Is there a way to perform such a replacement in a faster way? Are regex a viable option?

Below is my implementation (works on LINQPAD with this data file)

static void Main()
{
    const string filePath = @"C:\data.txt";
    const string generatedFilesFolderPath = @"C:\";
    const string fileName = "data";

    using (new TimeIt("Whole process"))
    {
        var fileLines = File.ReadLines(filePath).Select(l => l.Split(new[] { ' ' }, 2)).ToList();
        var classValues = GetClassValues();
        foreach (var classValue in classValues)
        {
            var directoryPath = Path.Combine(generatedFilesFolderPath, classValue);

            if (!Directory.Exists(directoryPath))
                Directory.CreateDirectory(directoryPath);

            var classFilePath = Path.Combine(directoryPath, fileName);

            using (var file = new StreamWriter(classFilePath))
            {
                foreach (var line in fileLines)
                {
                    var lineFirstPart = line.First();
                    string newFirstPart = "-1";

                    var hashset = new HashSet<string>(lineFirstPart.Split(','));
                    if (hashset.Contains(classValue))
                    {
                        newFirstPart = "+1";
                    }

                    file.WriteLine("{0} {1}", newFirstPart, line.Last());
                }
            }
        }
    }

    Console.Read();
}

public static List<string> GetClassValues()
{
    // In real life there is 200 class values.
    return Enumerable.Range(0, 2).Select(c => c.ToString()).ToList(); 
}

public class TimeIt : IDisposable
{
    private readonly string _name;
    private readonly Stopwatch _watch;
    public TimeIt(string name)
    {
        _name = name;
        _watch = Stopwatch.StartNew();
    }
    public void Dispose()
    {
        _watch.Stop();
        Console.WriteLine("{0} took {1}", _name, _watch.Elapsed);
    }
}

The output:

Whole process took 00:00:00.1175102

EDIT: I also ran a profiler and it looks like the split method is the hottest spot.

enter image description here

EDIT 2: Simple example:

2,1 1:0.8 2:0.2
3   1:0.4 3:0.6
12  1:0.02 4:0.88 5:0.1

Expected output for class 2:

+1 1:0.8 2:0.2
-1 1:0.4 3:0.6
-1 1:0.02 4:0.88 5:0.1

Expected output for class 3:

-1 1:0.8 2:0.2
+1 1:0.4 3:0.6
-1 1:0.02 4:0.88 5:0.1

Expected output for class 4:

-1 1:0.8 2:0.2
-1 1:0.4 3:0.6
-1 1:0.02 4:0.88 5:0.1

Upvotes: 4

Views: 143

Answers (2)

rene
rene

Reputation: 42453

I have eliminated the hottest paths from your code by removing the split and using a bigger buffer on the FileStream.

Instead of Split I now call ToCharArray and then parse the first Chars to the first space and while I'm at it a match with classValue on a char by char basis is performed. The boolean found indicates an exact match for anything before the , of the first space. The rest of the handling is the same.

var fsw = new FileStream(classFilePath,
    FileMode.Create,
    FileAccess.Write,
    FileShare.None,
    64*1024*1024); // use a large buffer
using (var file = new StreamWriter(fsw)) // use the filestream
{
    foreach(var line in fileLines) // for( int i = 0;i < fileLines.Length;i++)
    {
        char[] chars = line.ToCharArray();
        int matched = 0;
        int parsePos = -1;
        bool takeClass = true;
        bool found = false;
        bool space = false;
        // parse until space
        while (parsePos<chars.Length && !space )
        {
            parsePos++;
            space = chars[parsePos] == ' '; // end
            // tokens
            if (chars[parsePos] == ' ' ||
                chars[parsePos] == ',')
            {
                if (takeClass 
                    && matched == classValue.Length)
                {
                    found = true;
                    takeClass = false;
                }
                else
                {
                    // reset matching
                    takeClass = true;
                    matched = 0;
                }
            }
            else
            {
                if (takeClass 
                    &&  matched < classValue.Length 
                    && chars[parsePos] == classValue[matched])
                {
                    matched++; // on the next iteration, match next
                }
                else
                {
                    takeClass = false; // no match!
                }    
            }
        }

        chars[parsePos - 1] = '1'; // replace 1 in front of space
        var correction = 1;
        if (parsePos > 1)
        {
            // is classValue before the comma (or before space)
            if (found)
            {
                chars[parsePos - 2] = '+';
            }
            else
            {
                chars[parsePos - 2] = '-';
            }
            correction++;
        }
        else
        {
            // is classValue before the comma (or before space)
            if (found)
            {
                // not enough space in the array, write a single char
                file.Write('+');
            }
            else
            {
                file.Write('-');
            }
        }
        file.WriteLine(chars, parsePos - correction, chars.Length - (parsePos - correction));
    }
}

Upvotes: 1

Mike Hixson
Mike Hixson

Reputation: 5189

Instead of iterating over the un-parsed lines 200 times, how about parsing the lines upfront into a data structure then iterating over that 200 times? This should minimize the numer of string manipulation operations.

Also using StreamReader instead of File.ReadLines, so the entire file is not in memory twice -- once as string[] and another time as Detail[].

static void Main(string[] args)
{
    var details = ReadDetail("data.txt").ToArray();
    var classValues = Enumerable.Range(0, 10).ToArray();

    foreach (var classValue in classValues)
    {
        // Create file/directory etc

        using (var file = new StreamWriter("out.txt"))
        {
            foreach (var detail in details)
            {
                file.WriteLine("{0} {1}", detail.Classes.Contains(classValue) ? "+1" : "-1", detail.Line);
            }
        }
    }
}

static IEnumerable<Detail> ReadDetail(string filePath)
{
    using (StreamReader reader = new StreamReader(filePath))
    {
        while (!reader.EndOfStream)
        {
            string line = reader.ReadLine();
            int separator = line.IndexOf(' ');

            Detail detail = new Detail
            {
                Classes = line.Substring(0, separator).Split(',').Select(c => Int32.Parse(c)).ToArray(),
                Line = line.Substring(separator + 1)
            };

            yield return detail;
        }
    }
}

public class Detail
{
    public int[] Classes { get; set; }
    public string Line { get; set; }
}

Upvotes: 1

Related Questions