Snowy
Snowy

Reputation: 6122

Memory Mapped File to Read End of File?

I have a 6gb file and the last 20 lines are bad. I would like to use a memory-mapped file with .NET 4 to read the last few lines and display them in console.writelines, and later go the last 20 lines and replace them with String.Empty. What is a cool way to do that using a memory-mapped file/stream with a C# example?

Thanks.

Upvotes: 0

Views: 4618

Answers (5)

kam
kam

Reputation: 669

First of all I will write the code in F#, but it should be possible to translate into C# code since my C# coding is rusty.

Second as I understand it, you need to make an effecient way to access the content of some file and alter it, then write it back.

To use a memorymappedfile you will need to first read it all into a temporary mappedfile tmp. This will only course a little overheat because you will do it all in one read. Then you use tmp to alter the content, and first after it is done you write the new file content back. This will properly be faster than using a normal filestream and you should not very about stack/heap overflow.

open System.IO
open Sytem.IO.MemoryMappedFiles

// Create a memorymapped image of the file content i.e. copy content
// return the memorymappedfile
// use is the same as using in C# 
let createMappedImage path =
    let mmf = MemorymappedFile.create("tmp", (fileInfo(path)).Length)
    use writer = new StreamWriter(mmf.CreaViewStream())
    writer.write(File.ReadAllText(path))
    mmf // return memorymappedfile to be used

// Some manipulation function to apply to the image


// type : char[] -> StreamReader -> unit 
let fillBuffer (buffer : byte[]) (reader : StreamReader) =
    let mutable entry = 0
    let mutable ret = reader.Read() // return -1 as EOF
    while ret >= 0 && entry < buffer.Length do
       buffer.[entry] <-  ret
       entry <- entry + 1
    entry // return count of byte read

 // type : int -> byte[] -> StreamWriter -> unit
 let flushBuffer count (buffer : byte[]) (writer : StreamWriter) =
     let stop = count + 1
     let mutable entry = 0
     while entry < stop do
        writer.Write(buffer.[entry])
        entry <- entry + 1
     // return unit e.i. void

 // read then write the buffer one time
 // writeThrough call fillBuffer which return the count of byte read,
 // and input it to the flushBuffer that then write it to the destination.
 let writeThrough buffer source dest =
     flushBuffer (fillBuffer buffer source) buffer dest
     // return unit


// write back the altered content of the image without overflow
let writeBackMappedImage bufsize dest image =
    // buffer for read/write
    let buf = Array.Create bsize (byte 0)// normal page is 4096 byte         
    // delete old content on write
    use writer = new StreamWriter(File.Open(dest,FileMode.Truncate,FileAccess.Write))
    use reader = new StreamReader(image.CreateViewStream())
    while not reader.EndOfStream do
        writeThrough buf reader writer

let image = createMappedImage "some path"
let alteredImage = alteration image // some undefined function to correct the content.
writeBackMappedImage image
image.dispose()
image.close()

This hasn't been run so there is likely to be some errors, but the idea is clear i think. as said above the createMappedImage create an memory mapped image file of the file.

The fillbuffer takes a byte array and a streamreader, then fill it and return The flushBuffer takes a count of how much of the buffer should be flushed, a source streamreader and a destination streamwriter.

Anything you will need to do to the file you can do to the image, without doing something unintentionally and dangerous to the file. when you are sure that the transformation are correct you can then write the image content back.

Upvotes: 0

Nicholas Carey
Nicholas Carey

Reputation: 74227

I don't know anything about ReverseStreamReaders. The solution is [essentially] simple:

  • Seek to end-of-file
  • Read lines in reverse. Counting characters as you go.
  • When you've accumulated 20 lines, you're done: set the file length on the stream, by decrementing the number of characters contained in the 20 lines and close the file.

The devil is in the details, though, regarding that "read lines in reverse part". There are some complicating factors that are likely to get you in trouble:

  1. You can't seek on a StreamReader, only on a stream.
  2. The last line of the file may or may not be terminated with a CRLF pair.
  3. The .Net framework's I/O classes don't really differentiate between CR, LF or CRLF as line-terminators. They just punted on that convention.
  4. Depending on the encoding used to store the file, reading backwards is very problematic. You don't know what a particular octet/byte represents: it may well be part of a multi-byte encoding sequence. Character != Byte in this modern age. The only way you are safe is if you know either that the file uses a single-byte encoding or, if it is UTF-8, that it contains no characters with a codepoint greater than 0x7F.

I'm not sure there's a good, easy solution outside of the obvious: read sequentially through the file and don't write the last twenty lines.

Upvotes: 0

tenor
tenor

Reputation: 1105

There are two parts to the solution. For the first part, you need to read the memory map backwards to grab lines, until you have read the number of lines you want (20 in this case).

For the second part, you want to truncate the file by the last twenty lines (by setting them to string.Empty). I'm not sure if you can do this with a memory map. You may have to make a copy of the file somewhere and overwrite the original with the source data except the last xxx bytes (which represents the last twenty lines)

The code below will extract the last twenty lines and display it.

You'll also get the position (lastBytePos variable) where the last twenty lines begin. You can use that information to know where to truncate the file.

UPDATE: To truncate the file call FileStream.SetLength(lastBytePos)

I wasn't sure what you meant by the last 20 lines are bad. In case the disk is physically corrupt and the data cannot be read, I've added a badPositions list that holds the positions where the memorymap had problems reading the data.

I don't have a +2GB file to test with, but it should work (fingers crossed).

using System;
using System.Collections.Generic;
using System.Text;
using System.IO.MemoryMappedFiles;
using System.IO;

namespace ConsoleApplication
{
    class Program
    {
        static void Main(string[] args)
        {
            string filename = "textfile1.txt";
            long fileLen = new FileInfo(filename).Length;
            List<long> badPositions = new List<long>();
            List<byte> currentLine = new List<byte>();
            List<string> lines = new List<string>();
            bool lastReadByteWasLF = false;
            int linesToRead = 20;
            int linesRead = 0;
            long lastBytePos = fileLen;

            MemoryMappedFile mapFile = MemoryMappedFile.CreateFromFile(filename, FileMode.Open);

            using (mapFile)
            {
                var view = mapFile.CreateViewAccessor();

                for (long i = fileLen - 1; i >= 0; i--) //iterate backwards
                {

                    try
                    {
                        byte b = view.ReadByte(i);
                        lastBytePos = i;

                        switch (b)
                        {
                            case 13: //CR
                                if (lastReadByteWasLF)
                                {
                                    {
                                        //A line has been read
                                        var bArray = currentLine.ToArray();
                                        if (bArray.LongLength > 1)
                                        {
                                            //Add line string to lines collection
                                            lines.Insert(0, Encoding.UTF8.GetString(bArray, 1, bArray.Length - 1));

                                            //Clear current line list
                                            currentLine.Clear();

                                            //Add CRLF to currentLine -- comment this out if you don't want CRLFs in lines
                                            currentLine.Add(13);
                                            currentLine.Add(10);

                                            linesRead++;
                                        }
                                    }
                                }
                                lastReadByteWasLF = false;

                                break;
                            case 10: //LF
                                lastReadByteWasLF = true;
                                currentLine.Insert(0, b);
                                break;
                            default:
                                lastReadByteWasLF = false;
                                currentLine.Insert(0, b);
                                break;
                        }

                        if (linesToRead == linesRead)
                        {
                            break;
                        }


                    }
                    catch
                    {
                        lastReadByteWasLF = false;
                        currentLine.Insert(0, (byte) '?');
                        badPositions.Insert(0, i);
                    }
                }

            }

            if (linesToRead > linesRead)
            {
                //Read last line
                {
                    var bArray = currentLine.ToArray();
                    if (bArray.LongLength > 1)
                    {
                        //Add line string to lines collection
                        lines.Insert(0, Encoding.UTF8.GetString(bArray));
                        linesRead++;
                    }
                }
            }

            //Print results
            lines.ForEach( o => Console.WriteLine(o));
            Console.ReadKey();
        }
    }
}

Upvotes: 0

Simon Mourier
Simon Mourier

Reputation: 138841

Memory Mapped Files can be a problem for big files (typically files that are of a size equivalent or bigger than the RAM), in case you eventually map the whole file. If you map only the end, that should not be a real issue.

Anyway, here is a C# implementation that does not use Memory Mapped File, but a regular FileStream. It is based on a ReverseStreamReader implementation (code also included). I would be curious to see it compared to other MMF solutions in terms of performance and memory consumption.

public static void OverwriteEndLines(string filePath, int linesToStrip)
{
    if (filePath == null)
        throw new ArgumentNullException("filePath");

    if (linesToStrip <= 0)
        return;

    using (FileStream file = new FileStream(filePath, FileMode.Open, FileAccess.ReadWrite))
    {
        using (ReverseStreamReader reader = new ReverseStreamReader(file))
        {
            int count = 0;
            do
            {
                string line = reader.ReadLine();
                if (line == null) // end of file
                    break;

                count++;
                if (count == linesToStrip)
                {
                    // write CR LF
                    for (int i = 0; i < linesToStrip; i++)
                    {
                        file.WriteByte((byte)'\r');
                        file.WriteByte((byte)'\n');
                    }

                    // truncate file to current stream position
                    file.SetLength(file.Position);
                    break;
                }
            }
            while (true);
        }
    }
}

// NOTE: we have not implemented all ReadXXX methods
public class ReverseStreamReader : StreamReader
{
    private bool _returnEmptyLine;

    public ReverseStreamReader(Stream stream)
        : base(stream)
    {
        BaseStream.Seek(0, SeekOrigin.End);
    }

    public override int Read()
    {
        if (BaseStream.Position == 0)
            return -1;

        BaseStream.Seek(-1, SeekOrigin.Current);
        int i = BaseStream.ReadByte();
        BaseStream.Seek(-1, SeekOrigin.Current);
        return i;
    }

    public override string ReadLine()
    {
        if (BaseStream.Position == 0)
        {
            if (_returnEmptyLine)
            {
                _returnEmptyLine = false;
                return string.Empty;
            }
            return null;
        }

        int read;
        StringBuilder sb = new StringBuilder();
        while((read = Read()) >= 0)
        {
            if (read == '\n')
            {
                read = Read();
                // supports windows & unix format
                if ((read > 0) && (read != '\r'))
                {
                    BaseStream.Position++;
                }
                else if (BaseStream.Position == 0)
                {
                   // handle the special empty first line case
                    _returnEmptyLine = true;
                }
                break;
            }
            sb.Append((char)read);
        }

        // reverse string. Note this is optional if we don't really need string content
        if (sb.Length > 1)
        {
            char[] array = new char[sb.Length];
            sb.CopyTo(0, array, 0, array.Length);
            Array.Reverse(array);
            return new string(array);
        }
        return sb.ToString();
    }
}

Upvotes: 3

k rey
k rey

Reputation: 631

From the question it sounds like you need to have a Memory Mapped file. However, there is a way to do this without using a memory mapped file.

Open the file normally, then move the file pointer to the end of the file. Once you are at the end, read the file in reverse (decrement the file pointer after each read) until you get the desired number of characters.

The cool way...load the characters into an array in reverse as well then you do not have to reverse them once you are done reading.

Do the fix to the array then write them back. Close, Flush, Complete!

Upvotes: 1

Related Questions