user236520
user236520

Reputation:

Recommended way of parsing dates when presented in a variety of formats

I have a collection of dates as strings entered by users over a period of time. Since these came from humans with little or no validation., the formats entered for the dates varies widely. Below are some examples (the leading numbers are for reference only):

  1. 20th, 21st August 1897
  2. 31st May, 1st June 1909
  3. 29th January 2007
  4. 10th, 11th, 12th May 1954
  5. 26th, 27th, 28th, 29th, 30th March 2006
  6. 27th, 28th, 29th, 30th November, 1st December 2006

I would like to parse these dates in c# to end up with sets of DateTime objects, with one DateTime object per day. So (1) above would result in 2 DateTime objects and (6) would result in 5 DateTime objects.

Upvotes: 3

Views: 339

Answers (2)

user236520
user236520

Reputation:

I thought some more about this and the solution became obvious. Tokenize the string and parse the tokens in reverse order. This will retrieve the year, then month then day(s). Here is my solution:

// **** Start definition of the class bcdb_Globals ****
public static class MyGlobals
{
    static Dictionary<string, int> _month2Int = new Dictionary<string, int>
    {
        {"January", 1},
        {"February", 2},
        {"March", 3},
        {"April", 4},
        {"May", 5},
        {"June", 6},
        {"July", 7},
        {"August", 8},
        {"September", 9},
        {"October", 10},
        {"November", 11},
        {"December", 12}
    };
    static public int GetMonthAsInt(string month)
    {
        return( _month2Int[month] );
    }
}


public class MyClass
{
    static char[] gDateSeparators = new char[2] { ',', ' ' };

    static Regex gDayRegex = new Regex("[0-9][0-9]?(st|nd|rd|th)");
    static Regex gMonthRegex = new Regex("January|February|March|April|May|June|July|August|September|October|November|December");
    static Regex gYearRegex = new Regex("[0-9]{4}");

    public void ParseMatchDate(string matchDate)
    {
        Stack matchDateTimes = new Stack();
        string[] tokens = matchDate.Split(gDateSeparators,StringSplitOptions.RemoveEmptyEntries);
        int curYear = int.MinValue;
        int curMonth = int.MinValue;
        int curDay = int.MinValue;

        for (int pos = tokens.Length-1; pos >= 0; --pos)
        {
            if (gYearRegex.IsMatch(tokens[pos]))
            {
                curYear = int.Parse(tokens[pos]);
            }
            else if (gMonthRegex.IsMatch(tokens[pos]))
            {
                curMonth = MyGlobals.GetMonthAsInt(tokens[pos]);
            }
            else if (gDayRegex.IsMatch(tokens[pos]))
            {
                string tok = tokens[pos];
                curDay = int.Parse(tok.Substring(0,(tok.Length-2)));
                // Dates are in reverse order, so using a stack means we'll pull em off in the correct order
                matchDateTimes.Push(new DateTime(curYear, curMonth, curDay));
            }
        }

        // Now get the datetimes
        while (matchDateTimes.Count > 0)
        {
            // Do something with dates here
        }
    }

}

Upvotes: 0

Brad Christie
Brad Christie

Reputation: 101614

I would recommend processing them for generalization (basically remove the numbers and names and make them place holders) then group by similar format so you have a sample group to work with.

For example, 20th, 21st August 1987 then becomes [number][postfix], [number][postfix] [month] [year] (given that a <number><st|th|rd|nd> is recognized as number and postfix and months are obvious, and years are 4-digit numerics).

From there, you find out how many follow that pattern, and then find how many unique patterns you need to match. Then you can at least have a sample to test any kind of algorithm you wish to use at it (regex is probably going to be your best bet since it can detect repeated patterns (#th[, $th[, ...]]) and day names.)


It appears you probably want to break it down by pattern (given what you've provided). So, for instance first break out yearly information:

(.*?)([0-9]{4})(?:, |$)

Then you need to break it down in to months

(.*?)(January|February|...)(?:, |$)

Then you want days contained within that month:

(?:([0-9]{1,2})(?:st|nd|rd|th)(?:, )?)*(?:, |$)

Then it's about compiling the information. But again, that's just using what you have in front of me. Ultimately you need to know what kind of data you're working with and how you want to tackle it.


Updated

So, i couldn't help but try to tackle this on my own. I wanted to prive that the method I was using was some-what accurate and I wasn't blowing smoke up your skirt. Having said that, this is what I have come up with. Note that this is in PHP for a couple of reasons:

  1. PHP was easier to get my hands on to
  2. I felt that if this was a viable solution, you should have to work at porting it over. :grin:

Anyways, here's the source and demo output. Enjoy.

<?php
  $samples = array(
    '20th, 21st August 1897',
    '31st May, 1st June 1909',
    '29th January 2007',
    '10th, 11th, 12th May 1954',
    '26th, 27th, 28th, 29th, 30th March 2006',
    '27th, 28th, 29th, 30th November, 1st December 2006',
    '30th, 31st, December 2010, 1st, 2nd January 2011'
  );

  //header('Content-Type: text/plain');

  $months = array('january','february','march','april','may','june','july','august','september','october','november','december');

  foreach ($samples as $sample)
  {
    $dates = array();

    // find yearly information first
    $yearly = null;
    if (preg_match_all('/(?:^|\s)(?<month>.*?)\s?(?<year>[0-9]{4})(?:$|,)/',$sample,$yearly))
    {//var_dump($yearly);
      for ($y = 0; $y < count($yearly[0]); $y++)
      {
        $year = $yearly['year'][$y];
        //echo "year: {$year}\r\n";

        $monthly = null;
        if (preg_match_all('/(?<days>(?:(?:^|\s)[0-9]{1,2}(?:st|nd|rd|th),?)*)\s?(?<month>'.implode('|',$months).')$/i',$yearly['month'][$y],$monthly))
        {//var_dump($monthly);
          for ($m = 0; $m < count($monthly[0]); $m++)
          {
            $month = $monthly['month'][$m];
            //echo "month: {$month}\r\n";

            $daily = null;
            if (preg_match_all('/(?:^|\s)(?<day>[0-9]{1,2})(?:st|nd|rd|th)(?:,|$)/i',$monthly['days'][$m],$daily))
            {//var_dump($daily);
              for ($d = 0; $d < count($daily[0]); $d++)
              {
                $day = $daily['day'][$d];
                //echo "day: {$day}\r\n";

                $dates[] = sprintf("%d-%d-%d", array_search(strtolower($month),$months)+1, $day, $year);
              }
            }
          }
        }
        $data = $yearly[1];
      }
    }

    echo "<p><b>{$sample}</b> was parsed to include:</p><ul>\r\n";
    foreach ($dates as $date)
      echo "<li>{$date}</li>\r\n";
    echo "</ul>\r\n";
  }
?>

20th, 21st August 1897 was parsed to include:

  • 8-20-1897
  • 8-21-1897

31st May, 1st June 1909 was parsed to include:

  • 6-1-1909

29th January 2007 was parsed to include:

  • 1-29-2007

10th, 11th, 12th May 1954 was parsed to include:

  • 5-10-1954
  • 5-11-1954
  • 5-12-1954

26th, 27th, 28th, 29th, 30th March 2006 was parsed to include:

  • 3-26-2006
  • 3-27-2006
  • 3-28-2006
  • 3-29-2006
  • 3-30-2006

27th, 28th, 29th, 30th November, 1st December 2006 was parsed to include:

  • 12-1-2006

30th, 31st, December 2010, 1st, 2nd January 2011 was parsed to include:

  • 12-30-2010
  • 12-31-2010
  • 1-1-2011
  • 1-2-2011

And to prove there's nothing up my sleeve, http://www.ideone.com/GGMaH

Upvotes: 3

Related Questions