Parsing text files using pentaho

Question

I have a large number of text files that I need to load into a database. Instead of being in the more usual csv format they are structured like this:

TY  - JOUR
T1  - On the Structure and Life-History of Entyloma ranunculi (Bonorden)
JF  - Philosophical Transactions of the Royal Society of London. B (1887-1895)
VL  - 178
SP  - 173
EP  - 185
PY  - 1887/01/01/
UR  - http://dx.doi.org/10.1098/rstb.1887.0008
M3  - doi:10.1098/rstb.1887.0008
AU  - Ward, H.
ER  -

Where each line is a different field, field name denoted by the leading characters .

What I would like to do is load each of the lines into a corresponding field in the record. I would like to do this via pentaho, anyone got any idea how to accomplish this. The text input step is set up for csv input.

shawty · Accepted Answer

That text sample looks awfully familiar.....

scurries off to check something

If that text sample is what I think it is "A royal scientific society journal description file" then your not going to be able to write a pentaho script to parse these things.

I've been there, tried it and it was painful, very painful.

Why?

Well a number of things.

First off, the file formats are NOT strictly checked, so you WILL FIND some files that don't have the 2 character ID followed by 2 spaces a dash 1 space and data format lines in them.

You'll also find some files that have unparsed LATEX commands in them and / or unprocessed variable substitutions.

In short the files (at least the ones I looked at last time I did anything with these) where in a horrible state.

The other main issue you will have is missing lines.

Every descriptor is supposed to have 11 primary tags as follows:

TY
T1
JF
VL
SP
EP
PY
UR
M3
AU
ER

which from memory are:

TY - Title
T1 - Description
JF - ???
VL - Volume number
SP - Start page
EP - End page
PY - Published Year
UR - Url
M3 - ???
AU - Author name
ER - ???

Frequently you'll find that not all these lines are present, but to make the columns line up in a CSV you still need to add blank entries.

Also beware of the AU field, it can and very often does contain multiple entries to a file, so you'll often get:

TY  - ....
T1  - ....
....
AU  - ....
AU  - ....
....

Processing this using the pentaho method in Carey's answer above is going to throw a lot of your rows out of sync because that's going to expect one row per tag per file

on the subject of Carey's answer, I have to say it's a very good answer, and came way closer to doing a good conversion than I ever managed before I gave up, but the cold hard fact is the files are just not in suitable state to be reliably processed by pentaho.

To this, Iv'e dragged out some C# I wrote to take a folder full of these text files and turn them into a flattened CSV.

The resulting CSV is not perfect and will still require a small amount of tweaking but it will get you 99.9% of the way there, and the resulting file will be way easier to process using pentaho than the source files themselves.

The code is fairly generic C# so it should compile both on windows and on mono (although I have to admit, Iv'e not tested it on the later)

here's the code:

using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;

namespace SciDataParse
{
  class RecordData
  {
    public string TY { get; set; }
    public string T1 { get; set; }
    public string JF { get; set; }
    public string VL { get; set; }
    public string SP { get; set; }
    public string EP { get; set; }
    public string PY { get; set; }
    public string UR { get; set; }
    public string M3 { get; set; }
    public List AU { get; set; }
    public string ER { get; set; }

    public RecordData()
    {
      AU = new List();
      TY = string.Empty;
      T1 = string.Empty;
      JF = string.Empty;
      VL = string.Empty;
      SP = string.Empty;
      EP = string.Empty;
      PY = string.Empty;
      UR = string.Empty;
      M3 = string.Empty;
      ER = string.Empty;
    }
  }

  class Program
  {
    static RecordData ProcessFile(string inputName)
    {
      RecordData result = new RecordData();

      using (StreamReader reader = new StreamReader(inputName))
      {
        string inputLine = reader.ReadLine();
        while(!string.IsNullOrEmpty(inputLine))
        {
          if (!Regex.IsMatch(inputLine, @"^[A-Z,0-9][A-Z,0-9]\s+-\s+.*$"))
          {
            inputLine = reader.ReadLine();
            continue; // Regex match to ensure lines are valid format
          }
          string[] lineItems = inputLine.Split('-');
          string tag = lineItems[0].Trim();
          string data = lineItems[1].Trim();
          switch (tag)
          {
            // Sort and add lines to our result object.  Note we check and change null to empty strings and filter commas
            // so that we don't create any problems with outputting CSV data
            case "TY" :
              result.TY = !string.IsNullOrEmpty(data) ? data : string.Empty;
              break;

            case "T1":
              result.T1 = !string.IsNullOrEmpty(data) ? data.Replace(",", string.Empty) : string.Empty;
              break;

            case "JF":
              result.JF = !string.IsNullOrEmpty(data) ? data.Replace(",", string.Empty) : string.Empty;
              break;

            case "VL":
              result.VL = !string.IsNullOrEmpty(data) ? data : string.Empty;
              break;

            case "SP":
              result.SP = !string.IsNullOrEmpty(data) ? data : string.Empty;
              break;

            case "EP":
              result.EP = !string.IsNullOrEmpty(data) ? data : string.Empty;
              break;

            case "PY":
              result.PY = !string.IsNullOrEmpty(data) ? data : string.Empty;
              break;

            case "UR":
              result.UR = !string.IsNullOrEmpty(data) ? data : string.Empty;
              break;

            case "M3":
              result.M3 = !string.IsNullOrEmpty(data) ? data : string.Empty;
              break;

            case "AU":
              // AU = Author items of which there can be multiple, note we also replace blank author names with "Unknown"
              result.AU.Add(!string.IsNullOrEmpty(data) ? data.Replace(",", string.Empty) : "Unknown");
              break;

            case "ER":
              result.ER = !string.IsNullOrEmpty(data) ? data : string.Empty;
              break;
          }
          inputLine = reader.ReadLine();
        }
      }

      return result;
    }

    static void Main()
    {
      List fileRecords = new List();
      List headerColumns = new List {"TY", "T1", "JF", "VL", "SP", "EP", "PY", "UR", "M3", "AU", "ER"};

      string baseFolder = Directory.GetCurrentDirectory();

      string[] fileNames = Directory.GetFiles(baseFolder, "*.txt");

      foreach (string fileName in fileNames)
      {
        fileRecords.Add(ProcessFile(fileName));
      }

      using (StreamWriter writer = new StreamWriter("consolodated_data.csv"))
      {
        string headerRow = string.Join(",", headerColumns);
        writer.WriteLine(headerRow);

        foreach (RecordData fileRecord in fileRecords)
        {
          string fileLine = string.Empty;
          fileLine += fileRecord.TY + ",";
          fileLine += fileRecord.T1 + ",";
          fileLine += fileRecord.JF + ",";
          fileLine += fileRecord.VL + ",";
          fileLine += fileRecord.SP + ",";
          fileLine += fileRecord.EP + ",";
          fileLine += fileRecord.PY + ",";
          fileLine += fileRecord.UR + ",";
          fileLine += fileRecord.M3 + ",";
          fileLine += string.Join("|",fileRecord.AU) + ","; // Join author names with a |
          fileLine += fileRecord.ER;
          writer.WriteLine(fileLine);
        }
      }

    }
  }
}

Compile the file, then copy the resulting EXE into the folder where all your txt files are and run it.

Now before any of you C# purists start jumping in here and picking fault with my code, please note this...

A) It was written some time ago, as a quick tool to solve a problem I had, it was never meant to be production code.

B) Yes I know there are better ways to do stuff, such as string builders vs concatenation, See point A

C) If you do come in and start picking faults at my code, instead of trying to help the OP with his issue (Like I have) then your just a douchebag with nothing better to do.

D) any faults you attempt to point out with my code won't result in me losing any sleep because quite frankly, I don't care. It works, I know it works (because it did what I needed it to do at the time) and that's all I'm bothered about.

Back to the OP's question.

As I say, it's not perfect, you will have to do a minimal amount of editing.

Any editing you do have to do, you will want to do using a plain text editor, if memory serves some of the description lines are horribly long and will exceed the maximum column width available in excel thus resulting in a line of '###################'

You can of course just import the CSV directly into your database using pentaho and then just edit the records in there to tidy the data up as well (That's how I did it)

one last thing to note, the author names (or at least where there are multiple ones) are joined using a | character into one CSV field under the "AU" tag, so when you come to further process them, you might want to consider putting them in their own table and de-normalising them with their own ID and foreign key pointing back to the source record.

If you can't compile it (or don't want to compile it) drop me a line, tell me your platform and I'll build it for you and send you the binary.

Parsing text files using pentaho

Answers (2)

Related Questions