C# Multi line and multi column email reader extractor

Question

So I have made a console application in C# that will read emails and extract the data from it.

With some help I have got it to a stage where it can read columns in pairs but as soon as I hit the bottom of the email (There could be even more lines than these two) it fails to break it down.

This is what I have tried:

using System;
using System.Text.RegularExpressions;
using System.Collections.Generic;

namespace Multiline_Email_Test
{
// 
/// Console app to test the reading of the multiline email.
/// If successful readback is shown we could import to SQL Server.
/// 
public class Program
{
    public static void Main()
    {
        string email = @"NOTIFICATION OF MOVEMENT STARTING IN AUGUST

Consignor Package ID                              Local Reference Number
-------------------                              ----------------------
GRLK123450012                                         123456

Place Of dispatch                                Guarantor type code
-----------------                                -------------------
GR00001234567                                          1

Consignee Package ID                              Guarantor details
-----------------                                -------------------
RR001239E0070

Place Of delivery                                Date of dispatch DD MM YYYY
-----------------                                ---------------------------
FR001379E0570                                    21 03 2019

                                                 Time of dispatch
                                                 ----------------
                                                 08:29

                                                Vehicle registration number
                                               ---------------------------
                                               XXBB12345678

Item number   Package Product CN CodeCode    Quantity       Brand
-----------   -------------------------     --------       -----
Line 1 of 2   B000           22040009       7603.200       Guinness DIC    440ml CAN 06X04 MDCES
Line 2 of 2   B000           22040009       14636.160      Guinness DIC    440ml CAN 06X04 MDCES

";



var dict = new Dictionary();
        try
        {
            var lines = email.Split(Environment.NewLine.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
            int starts = 0, end = 0, length = 0;
            while (!lines[starts + 1].StartsWith("-"))
                starts++;
            for (int i = starts + 1; i < lines.Length; i += 3)
            {
                var mc = Regex.Matches(lines[i], @"(?:^| )-");
                foreach (Match m in mc)
                {
                    int start = m.Value.StartsWith(" ") ? m.Index + 1 : m.Index;
                    end = start;
                    while (lines[i][end++] == '-' && end < lines[i].Length)
                        ;
                    length = Math.Min(end - start, lines[i - 1].Length - start);
                    string key = length > 0 ? lines[i - 1].Substring(start, length).Trim() : "";
                    end = start;
                    while (lines[i][end++] == '-' && end < lines[i].Length)
                        ;
                    length = Math.Min(end - start, lines[i + 1].Length - start);
                    string value = length > 0 ? lines[i + 1].Substring(start, length).Trim() : "";
                    dict.Add(key, value);
                }
            }
        }
        catch (Exception ex)
        {
            throw new Exception(ex.ToString());
        }

        foreach (var x in dict)
            Console.WriteLine("{0} : {1}", x.Key, x.Value);
       }
   }
}

I have created a live demo in .net fiddle here https://dotnetfiddle.net/6nMO2c

Dmo · Accepted Answer

Regarding the header values of the document, your code seems to be functional, but just for fun, I found a regex that does the job. Then I also answer the question about the table data.

        int textArrayPosition = 0; // Just to separate the header part and the table part
        var headersDictionary = new Dictionary();
        List arrayHeaders;
        List> arrayData = new List>();
        var headersFinder = new Regex(@"^(.*?) {2,}(.*)
\-*? {2,}\-*
(.*?)( {2,}(.*)|$)", RegexOptions.Multiline);

        foreach (Match match in headersFinder.Matches(inputText))
        {
            if (match.Groups.Count < 4)
                continue;

            var firstHeaderName = match.Groups[1].Value;
            var secondHeaderName = match.Groups[2].Value;

            if (!string.IsNullOrWhiteSpace(firstHeaderName))
                headersDictionary.Add(firstHeaderName, match.Groups[3].Value);

            if (!string.IsNullOrWhiteSpace(secondHeaderName))
            {
                if (match.Groups.Count == 6)
                    headersDictionary.Add(secondHeaderName, match.Groups[5].Value);
                else
                    headersDictionary.Add(secondHeaderName, string.Empty);
            }

            textArrayPosition = match.Index + match.Length;
        }

        Console.WriteLine("*** Document headers :");
        foreach (var entry in headersDictionary)
            Console.WriteLine($"{entry.Key} = {entry.Value}");

Then, we find the table in your text as a list of lines.

 var arrayLines = inputText.Substring(textArrayPosition).Split(new string[] { "
", "
" }, StringSplitOptions.RemoveEmptyEntries);

Thus, we treat the table: as the headers of the table do not allow to separate the columns, I based myself on the fact of finding at least 2 consecutive spaces in the first line of data to be able to guess the positions of the columns. A simple regex helps us to do that.

        if (arrayLines.Length > 2)
        {
            var arrayColsPositions = new List();

            // Find cols positions
            arrayColsPositions.Add(0);
            var firstDataLine = arrayLines[2];
            var columnsPositionDetector = new Regex(@" {2,}", RegexOptions.Singleline);
            foreach (Match match in columnsPositionDetector.Matches(firstDataLine))
            {
                arrayColsPositions.Add(match.Index + match.Length);
            }

            // Find headers
            arrayHeaders = ReadLineValues(arrayLines[0], arrayColsPositions).ToList();
            // Find data lines
            for (int lineId = 2; lineId < arrayLines.Length; lineId++)
            {
                arrayData.Add(ReadLineValues(arrayLines[lineId], arrayColsPositions).ToList());
            }

            Console.WriteLine("
*** Array headers :");
            Console.WriteLine(string.Join(", ", arrayHeaders));

            Console.WriteLine("
*** Array lines data :");
            foreach (var record in arrayData)
            {
                Console.WriteLine(string.Join(", ", record));
            }
        }
        else
            Console.WriteLine("The array is empty.");

Finally, here is the little utility method that I developed to search nicely, without exceeding the lengths of certain lines, for the data in the right place.

    private static IEnumerable ReadLineValues(string sourceLine, List colsPositions)
    {
        for (int colId = 0; colId < colsPositions.Count; colId++)
        {
            var start = colsPositions[colId];
            int length;
            if (colId < colsPositions.Count - 1)
                length = colsPositions[colId + 1] - start;
            else
                length = sourceLine.Length - start;

            if (start < sourceLine.Length)
            {
                if (start + length > sourceLine.Length)
                    length = sourceLine.Length - start;

                yield return sourceLine.Substring(start, length).Trim();
            }
        }
    }

C# Multi line and multi column email reader extractor

Answers (1)

Related Questions