zumalifeguard
zumalifeguard

Reputation: 9016

Properly editing a text or source file from C#/.NET

The typical way to edit a text or source file from code is to read the file using File.ReadAllLines or File.ReadAllText, make your changes, and then write it out using WriteAllLines or WriteAllText.

However, if you were to open the text file (say some source code file) in Visual Studio or Notepad++, scroll down a few lines, make a change, and save, a lot more is handled.

It seems that what is handled, at least on Windows, is a complicated set of rules and heuristics that takes into account, at a minimum:

  1. The inferred encoding of the text file.
  2. The line-endings
  3. Whether the last line is an "incomplete line" (as described in diffutils/manual, namely a line with no line ending character(s)

I'll discuss these partially just to illustrate the complexity. My question is, is there a full set of heuristics, an already established algorithm that can be used or an existing component that encapsulates this.

Inferred Encoding

Most common for source / text files:

  1. UTF-16 with BOM
  2. UTF-8 with BOM
  3. UTF-8 without BOM

When there's no BOM, the encoding is inferred using some heuristics. It could be ASCII or Windows1252 (Encoding.GetEncoding(1252)), or BOMless UTF-8 It depends on what the rest of the data looks like. If there's some known upper-ascii or what might look UTF-8.

When you save, you need to keep the same encoding.

Line endings

You have to keep the same line-endings. So if the file uses CR/LF, then keep it at CR/LF. But when it's just LF, then keep that. But it can get more complicated then that as given text file may have both, and one would need to maintain that as well. For example, a source file that's CR/LF may, inside of it, have a section that's only LF line-ended only. This can happen when someone pastes text from another tool into a literal multi-line string, such as C#'s @"" strings. Visual Studio handles this correctly.

Incomplete lines

If the last line is incomplete, that has to be maintained as well. That means, if the last line doesn't end with end-of-line character(s)

Possible approach

I think one way to get around all of these problems from the start is to treat the file as binary instead of text. This means the normal text-file processing in .NET cannot be used. A new set of APIs will be needed to handle editing such files.

I can imaging a component that requires you to open the file as a memory stream and pass that to the component. The component then can read the stream and provide a line-oriented view to clients, such that client code can iterate over the lines for processing. Each element through the iteration will be an object of a type that looks something like this:

class LineElement
{
    int originalLineNumber;
    string[] lines;
    string[] lineEndings;
}

As an example for a normal text file on Windows:

the lines field can be modified. It can be replaced with empty array to delete the line or it can be replaced with with a multi-element array to insert lines (replacing the existing line)

lineEndings array handled similarly.

In many cases, new lines aren't removed or inserted, in which case the application code never has to deal with line-endings at all. They simply operate on the lines[] array, ignoring the lineEndings[] array.

I'm open to other suggestions.

Upvotes: 0

Views: 824

Answers (0)

Related Questions