Reputation: 81
So I have made a console application in C# that will read emails and extract the data from it.
With some help I have got it to a stage where it can read columns in pairs but as soon as I hit the bottom of the email (There could be even more lines than these two) it fails to break it down.
This is what I have tried:
using System;
using System.Text.RegularExpressions;
using System.Collections.Generic;
namespace Multiline_Email_Test
{
// <summary>
/// Console app to test the reading of the multiline email.
/// If successful readback is shown we could import to SQL Server.
/// </summary>
public class Program
{
public static void Main()
{
string email = @"NOTIFICATION OF MOVEMENT STARTING IN AUGUST
Consignor Package ID Local Reference Number
------------------- ----------------------
GRLK123450012 123456
Place Of dispatch Guarantor type code
----------------- -------------------
GR00001234567 1
Consignee Package ID Guarantor details
----------------- -------------------
RR001239E0070
Place Of delivery Date of dispatch DD MM YYYY
----------------- ---------------------------
FR001379E0570 21 03 2019
Time of dispatch
----------------
08:29
Vehicle registration number
---------------------------
XXBB12345678
Item number Package Product CN CodeCode Quantity Brand
----------- ------------------------- -------- -----
Line 1 of 2 B000 22040009 7603.200 Guinness DIC 440ml CAN 06X04 MDCES
Line 2 of 2 B000 22040009 14636.160 Guinness DIC 440ml CAN 06X04 MDCES
";
var dict = new Dictionary<string, string>();
try
{
var lines = email.Split(Environment.NewLine.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
int starts = 0, end = 0, length = 0;
while (!lines[starts + 1].StartsWith("-"))
starts++;
for (int i = starts + 1; i < lines.Length; i += 3)
{
var mc = Regex.Matches(lines[i], @"(?:^| )-");
foreach (Match m in mc)
{
int start = m.Value.StartsWith(" ") ? m.Index + 1 : m.Index;
end = start;
while (lines[i][end++] == '-' && end < lines[i].Length)
;
length = Math.Min(end - start, lines[i - 1].Length - start);
string key = length > 0 ? lines[i - 1].Substring(start, length).Trim() : "";
end = start;
while (lines[i][end++] == '-' && end < lines[i].Length)
;
length = Math.Min(end - start, lines[i + 1].Length - start);
string value = length > 0 ? lines[i + 1].Substring(start, length).Trim() : "";
dict.Add(key, value);
}
}
}
catch (Exception ex)
{
throw new Exception(ex.ToString());
}
foreach (var x in dict)
Console.WriteLine("{0} : {1}", x.Key, x.Value);
}
}
}
I have created a live demo in .net fiddle here https://dotnetfiddle.net/6nMO2c
Upvotes: 1
Views: 95
Reputation: 161
Regarding the header values of the document, your code seems to be functional, but just for fun, I found a regex that does the job. Then I also answer the question about the table data.
int textArrayPosition = 0; // Just to separate the header part and the table part
var headersDictionary = new Dictionary<string, string>();
List<string> arrayHeaders;
List<List<string>> arrayData = new List<List<string>>();
var headersFinder = new Regex(@"^(.*?) {2,}(.*)\r\n\-*? {2,}\-*\r\n(.*?)( {2,}(.*)|$)", RegexOptions.Multiline);
foreach (Match match in headersFinder.Matches(inputText))
{
if (match.Groups.Count < 4)
continue;
var firstHeaderName = match.Groups[1].Value;
var secondHeaderName = match.Groups[2].Value;
if (!string.IsNullOrWhiteSpace(firstHeaderName))
headersDictionary.Add(firstHeaderName, match.Groups[3].Value);
if (!string.IsNullOrWhiteSpace(secondHeaderName))
{
if (match.Groups.Count == 6)
headersDictionary.Add(secondHeaderName, match.Groups[5].Value);
else
headersDictionary.Add(secondHeaderName, string.Empty);
}
textArrayPosition = match.Index + match.Length;
}
Console.WriteLine("*** Document headers :");
foreach (var entry in headersDictionary)
Console.WriteLine($"{entry.Key} = {entry.Value}");
Then, we find the table in your text as a list of lines.
var arrayLines = inputText.Substring(textArrayPosition).Split(new string[] { "\n", "\r" }, StringSplitOptions.RemoveEmptyEntries);
Thus, we treat the table: as the headers of the table do not allow to separate the columns, I based myself on the fact of finding at least 2 consecutive spaces in the first line of data to be able to guess the positions of the columns. A simple regex helps us to do that.
if (arrayLines.Length > 2)
{
var arrayColsPositions = new List<int>();
// Find cols positions
arrayColsPositions.Add(0);
var firstDataLine = arrayLines[2];
var columnsPositionDetector = new Regex(@" {2,}", RegexOptions.Singleline);
foreach (Match match in columnsPositionDetector.Matches(firstDataLine))
{
arrayColsPositions.Add(match.Index + match.Length);
}
// Find headers
arrayHeaders = ReadLineValues(arrayLines[0], arrayColsPositions).ToList();
// Find data lines
for (int lineId = 2; lineId < arrayLines.Length; lineId++)
{
arrayData.Add(ReadLineValues(arrayLines[lineId], arrayColsPositions).ToList());
}
Console.WriteLine("\n*** Array headers :");
Console.WriteLine(string.Join(", ", arrayHeaders));
Console.WriteLine("\n*** Array lines data :");
foreach (var record in arrayData)
{
Console.WriteLine(string.Join(", ", record));
}
}
else
Console.WriteLine("The array is empty.");
Finally, here is the little utility method that I developed to search nicely, without exceeding the lengths of certain lines, for the data in the right place.
private static IEnumerable<string> ReadLineValues(string sourceLine, List<int> colsPositions)
{
for (int colId = 0; colId < colsPositions.Count; colId++)
{
var start = colsPositions[colId];
int length;
if (colId < colsPositions.Count - 1)
length = colsPositions[colId + 1] - start;
else
length = sourceLine.Length - start;
if (start < sourceLine.Length)
{
if (start + length > sourceLine.Length)
length = sourceLine.Length - start;
yield return sourceLine.Substring(start, length).Trim();
}
}
}
Upvotes: 1