Reputation: 794

Parsing a flat text file

I am developing an application and I have to upload data from CSV files into a DB tables. Problem is, I don’t have CSV files but I have flat text files to be converted into CSV. An additional problem is, as the application is used by several customers who have different systems, I have different flat text files with different layouts.

What I want to achieve is to create an application that loads “rules” from a special file; these rules will be processed with the flat text file in order generate the CSV file. The application that converts from flat file to CSV would be the same, just the set of rules would be different.

How can I achieve this? What is the best practice you recommend?

Upvotes: 4

Answers (2)

GolezTrol

Reputation: 116110

It depends on the complexity of the rules. If the only varying input is the names of the columns and the separator used, then it's pretty easy, but if you want to be able to parse completely different formats (like XML or so) as well, then it's a different story.

I myself would choose to implement a base class for a 'record' reader that reads records from a file and outputs them to a dataset or CSV. Then, you can implement child classes that implement reading different source formats.

If you like, you can then add specific rules for those format, so you can make a generic XMLReader that descends from BaseReader, but which allows for configurable column names. But I would start with a bunch of hard-coded readers for the formats you got, until it's more clear which dialects of those formats you may encounter.

Edit: On request, an example of how it could look like.

Note, this example is far from ideal! It reads a custom format, transfers it to one specific table structure and saves that as an CSV file. You may want to split it a little further, so you can reuse the code for different table structures. Especially the field defs, you may want to be able to set in a descendant class or maybe a factory class. But for the sake of simplicity I have taken a more rigid approach and put a little too much intelligence in one single base class.

The base class has the logic needed to create an in-memory dataset (I used a TClientDataSet). It can 'Migrate' a file. In practice, this means it reads, validates and exports the file.

The reading is abstract and must be implemented in a child class. It should read the data to the in memory dataset. That allows you to do all necessary validation in the client dataset. This allows to you enforce field types and sized and do any additional checking if you need to, in a database/file format agnostic way.

The validating and writing is done using the data in the dataset. From the moment where the source file is parsed to a dataset, no knowledge about the source file format is required anymore.

Declaration: Don't forget to use DB, DBClient.

type
  TBaseMigrator = class
  private
    FData: TClientDataset;
  protected
    function CSVEscape(Str: string): string;
    procedure ReadFile(AFileName: string); virtual; abstract;
    procedure ValidateData;
    procedure SaveData(AFileName: string);
  public
    constructor Create; virtual;
    destructor Destroy; override;

    procedure MigrateFile(ASourceFileName, ADestFileName: string); virtual;
  end;

Implementation:

{ TBaseReader }

constructor TBaseMigrator.Create;
begin
  inherited Create;
  FData := TClientDataSet.Create(nil);
  FData.FieldDefs.Add('ID', ftString, 20, True);
  FData.FieldDefs.Add('Name', ftString, 60, True);
  FData.FieldDefs.Add('Phone', ftString, 15, False);
  // Etc
end;

function TBaseMigrator.CSVEscape(Str: string): string;
begin
  // Escape the string to a CSV-safe format;
  // Todo: Check if this is sufficient!
  Result := '"' + StringReplace(Result, '"', '""', [rfReplaceAll]) + '"';
end;

destructor TBaseMigrator.Destroy;
begin
  FData.Free;
  inherited;
end;

procedure TBaseMigrator.MigrateFile(ASourceFileName, ADestFileName: string);
begin
  // Read the file. Descendant classes need to override this method.
  ReadFile(ASourceFileName);

  // Validation. Implemented in base class.
  ValidateData;

  // Saving/exporting. For now implemented in base class.
  SaveData(ADestFileName);
end;

procedure TBaseMigrator.SaveData(AFileName: string);
var
  Output: TFileStream;
  Writer: TStreamWriter;
  FieldIndex: Integer;
begin
  Output := TFileStream.Create(AFileName,fmCreate);
  Writer := TStreamWriter.Create(Output);
  try

    // Write the CSV headers based on the fields in the dataset
    for FieldIndex := 0 to FData.FieldCount - 1 do
    begin
      if FieldIndex > 0 then
        Writer.Write(',');
      // Column headers are escaped, but this may not be needed, since
      // they likely don't contain quotes, commas or line breaks.
      Writer.Write(CSVEscape(FData.Fields[FieldIndex].FieldName));
    end;
    Writer.WriteLine;

    // Write each row
    FData.First;
    while not FData.Eof do
    begin

      for FieldIndex := 0 to FData.FieldCount - 1 do
      begin
        if FieldIndex > 0 then
          Writer.Write(',');
        // Escape each value
        Writer.Write(CSVEscape(FData.Fields[FieldIndex].AsString));
      end;
      Writer.WriteLine;

      FData.Next
    end;

  finally
    Writer.Free;
    Output.Free;
  end;
end;

procedure TBaseMigrator.ValidateData;
begin
  FData.First;
  while not FData.Eof do
  begin
    // Validate the current row of FData
    FData.Next
  end;
end;

An example child class: TIniFileReader, which reads inifile sections as if they were database records. As you can see, you only need to implement the logic to read the file.

type
  TIniFileReader = class(TBaseMigrator)
  public
    procedure ReadFile(AFileName: string); override;
  end;

{ TIniFileReader }

procedure TIniFileReader.ReadFile(AFileName: string);
var
  Source: TMemIniFile;
  IDs: TStringList;
  ID: string;
  i: Integer;
begin
  // Initialize an in-memory dataset.
  FData.Close; // Be able to migrate multiple files with one instance.
  FData.CreateDataSet;

  // Parsing a weird custom format, where each section in an inifile is a
  // row. Section name is the key, section contains the other fields.
  Source := TMemIniFile.Create(AFileName);
  IDs := TStringList.Create;
  try
    Source.ReadSections(IDs);

    for i := 0 to IDs.Count - 1 do
    begin
      // The section name is the key/ID.
      ID := IDs[i];

      // Append a row.
      FData.Append;

      // Read the values.
      FData['ID'] := ID;
      FData['Name'] := Source.ReadString(ID, 'Name', '');
      // Names don't need to match. The field 'telephone' in this propriety
      // format maps to 'phone' in your CSV output.
      // Later, you can make this customizable (configurable) if you need to,
      // but it's unlikely that you encounter two different inifile-based
      // formats, so it's a waste to implement that until you need it.
      FData['Phone'] := Source.ReadString(ID, 'Telephone', '');

      FData.Post;
    end;

  finally
    IDs.Free;
    Source.Free;
  end;
end;

Upvotes: 7

Warren P

Reputation: 68902

This is very similar to the problems faced by "screen scrapers". If end users are intended to be able to use this, I would avoid regular expressions (except as an internal implementation detail, if needed) and not expose raw regular expression editing to end users.

Instead, I would let them load up samples of their data files, and construct their rules visually, with a drag, and drop style.

Click a "Match text" button, click and drag to select a rectangular region on the screen. Have options so that it might be allowed to move a certain amount up or down or left or right, if the format isn't precise or repeatable. Establish limits on how far you can go outside the original box.
Click a "grab text" button, click and drag to a rectangular or non-rectangular (flow) area on the screen. Name the output with a field, and give it a type (integer, string[x], etc). Similar limits apply as step 1.
Click save and the template rules are written to disk. Load a different file and see if the rules still apply nicely.

Relevant wikipedia topic.

Upvotes: 0

Parsing a flat text file

Answers (2)

Related Questions