Brandon
Brandon

Reputation: 3266

Parsing Line Breaks from Plain Text

I have a process that parses emails. The software that we're using to retrieve and store the contents of the body doesn't seem to include line-breaks, so I end up with something like this -

Good afternoon, [line-break] this is my email. [line-break] Info: data [line-break] More info: data

My [line-break] brackets are where the line breaks should be. However, when we extract the body, we get just the text. It makes it tough to parse the text without having the line breaks.

Essentially, what I need to do is parse each [Info]: [Data]. I can find where the [Info] tags begin, but without having line-breaks, I'm struggling to know where the data associated to that info should end. The email is coming from Windows.

Is there any way to take plain text and encode it to some way that would include line breaks?

Example Email Contents

Good Morning, Order: 1234 The Total: $445 When: 7/10 Type: Dry

Good Morning, Order: 1235 The Total: $1743 Type: Frozen When: 7/22

Order: 1236 The Total: $950.14 Type: DRY When: 7/10

The Total: $514 Order: 1237 Type: Dry CSR: Tim W

Sorry, below is your order: Order: 1236 The Total: $500 When: 7/10 Type: Dry Creator: Josh A. Thank you

Now, I need to loop through the email and parse out the values for Order, Total, and Type. The other placeholder: values are irrelevant and random.

Upvotes: 2

Views: 607

Answers (1)

Jimi
Jimi

Reputation: 32248

Try something like this.
You need to add all possible sections identifiers: it can be updated over time, to add more known identifiers, to reduce the chance of mistakes in parsing the strings.

As of now, if the value marked by a known identifier contains an unknown identifier when the string is parsed, that part is removed.
If an unknown identifier is encountered, it's ignored.

Regex.Matches will extract all matching parts, return their Value, the Index position and the length, so it's simple to use [Input].SubString(Index, NextPosition - Index) to return the value corresponding to the part requested.

The EmailParser class GetPartValue(string) returns the content of an identifier by its name (the name can include the colon char or not, e.g. "Order" or "Order:").
The Matches properties returns a Dictionary<string, string> of all matched identifiers and their content. The content is cleaned up - as possible - calling CleanUpValue() method.

Adjust this method to deal with some specific/future requirements.

► If you don't pass a Pattern string, a default one is used.
► If you change the Pattern, setting the CurrentPatter property (perhaps using one stored in the app settings or edited in a GUI or whatever else), the Dictionary of matched values is rebuilt.

Initialize with:

string input = "Good Morning,  Order: 1234 The Total: $445 Unknown: some value Type: Dry When: 7/10";
var parser = new EmailParser(input);
string value = parser.GetPartValue("The Total");
var values = parser.Matches;

public class EmailParser
{
    static string m_Pattern = "Order:|The Total:|Type:|Creator:|When:|CSR:";

    public EmailParser(string email) : this(email, null) { }
    public EmailParser(string email, string pattern)
    {
        if (!string.IsNullOrEmpty(pattern)) {
            m_Pattern = pattern;
        }
        Email = email;
        this.Matches = GetMatches();
    }

    public string Email { get; }

    public Dictionary<string, string> Matches { get; private set; }

    public string CurrentPatter {
        get => m_Pattern;
        set {
            if (value != m_Pattern) {
                m_Pattern = value;
                this.Matches = GetMatches();
            }
        }
    }

    public string GetPartValue(string part)
    {
        if (part[part.Length - 1] != ':') part += ':';
        if (!Matches.Any(m => m.Key.Equals(part))) {
            throw new ArgumentException("Part non included");
        }
        return Matches.FirstOrDefault(m => m.Key.Equals(part)).Value;
    }

    private Dictionary<string, string> GetMatches()
    {
        var dict = new Dictionary<string, string>();
        var matches = Regex.Matches(Email, m_Pattern, RegexOptions.Singleline);

        foreach (Match m in matches) {
            int startPosition = m.Index + m.Length;
            var next = m.NextMatch();
            string parsed = next.Success
                ? Email.Substring(startPosition, next.Index - startPosition).Trim()
                : Email.Substring(startPosition).Trim();

            dict.Add(m.Value, CleanUpValue(parsed));
        }
        return dict;
    }

    private string CleanUpValue(string value)
    {
        int pos = value.IndexOf(':');
        if (pos < 0) return value;
        return value.Substring(0, value.LastIndexOf((char)32, pos));
    }
}

Upvotes: 1

Related Questions