Conrad S
Conrad S

Reputation: 354

Properly parsing excel format for dates

I've been struggling with this issue for a while. I've tried with multiple python excel libraries and they all seem to have the same issue. The end desired result is essentially "what you see is what you get" for xlsx files. All of the python libraries that interact with excel return the value that is stored within excel and possibly the corresponding format for that value. I am struggling to then use that format to actually get a value that looks like what you see in excel or another spreadsheet application like libre office calc.

Say we have a sheet with a row that looks something like this: enter image description here

The format (shown using libre office calc) is here: enter image description here

And now here is some code to open the sheet and output the stored value and the format

import openpyxl
book = openpyxl.load_workbook(
    'test.xlsx',
    read_only=True,
    data_only=False,
)
sheet = book.worksheets[0]
for row in sheet.iter_rows():
    for cell in row:
        print('FORMAT:', cell.number_format)
        print('VALUE:', cell.value)
        print('TYPE:', type(cell.value))

Running that code (python 3.6.7, openpyxl 3.0.1) produces the following truncated output:

FORMAT: yyyy\-mm\-dd\Thh:mm\Z
VALUE: 2017-04-19 15:17:00.000004
TYPE: <class 'datetime.datetime'>
...

My question is, how do you parse that format string (yyyy-mm-dd\Thh:mm\Z) into a valid python strftime datetime representation. I started writing a simple function that uses string replace to replace yyyy with %Y, yy with %y and so on. But then I noticed that there are two instances of mm in the format string, one corresponding to months and one corresponding to minutes! How are you expected to parse that? Does month always come first? What happens when there is only minutes? And what if you want a datetime format with the time first, date second?

Any help would be greatly appreciated. Either a python library that already does this, a well documented specification of the xlsx file format that would allow me to build my own parser (I found this but it didn't seem to have what I wanted: https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/300280fd-e4fe-4675-a924-4d383af48d3b), or an example in another language. It would also be great if this could be generalized beyond dates and always be used for all excel formatting.

Upvotes: 0

Views: 1585

Answers (2)

Conrad S
Conrad S

Reputation: 354

Unfortunately, stovfl's solution was not actually generalized to all xlsx formats. After much searching through Microsoft documentation I was finally able to find this page that documents some of the rules of excel number_format.

Important things to note:

  • mm and m refer to minutes ONLY if the most immediate preceding code is a hh or h (hours) or if the most immediate following code is ss or s (seconds), otherwise mm and m refer to months.
  • Most characters that are not codes must be preceded with a backslash
  • Characters surrounded by quotation marks are interpreted literally (not as codes)
  • There's a big list of characters that display without any escaping or quotation marks
  • There is the concept of sections, separated by a semicolon. For the purpose of this solution I chose to ignore sections (because if the sections are actually used the resulting output will not look like a date).
  • Certain codes in excel have no equivalent code in strftime. For example, mmmmm displays the first letter of the month. For my solution I chose to replace these codes with similar strftime codes (for mmmmm I chose %b, which displays an abbreviation of the month). I noted these codes in comments

Anyways, I just built a function that given an excel number_format date string, returns the python strftime equivalent. I hope this can help someone looking for a way to get "What you see is what you get" from excel to python.

EXCEL_CODES = {
        'yyyy': '%Y',
        'yy': '%y',
        'dddd': '%A',
        'ddd': '%a',
        'dd': '%d',
        'd': '%-d',
        # Different from excel as there is no J-D in strftime
        'mmmmmm': '%b',
        'mmmm': '%B',
        'mmm': '%b',
        'hh': '%H',
        'h': '%-H',
        'ss': '%S',
        's': '%-S',
        # Possibly different from excel as there is no am/pm in strftime
        'am/pm': '%p',
        # Different from excel as there is no A/P or a/p in strftime
        'a/p': '%p',
}

EXCEL_MINUTE_CODES = {
    'mm': '%M',
    'm': '%-M',
}
EXCEL_MONTH_CODES = {
    'mm': '%m',
    'm': '%-m',
}

EXCEL_MISC_CHARS = [
    '$',
    '+',
    '(',
    ':',
    '^',
    '\'',
    '{',
    '<',
    '=',
    '-',
    '/',
    ')',
    '!',
    '&',
    '~',
    '}',
    '>',
    ' ',
]

EXCEL_ESCAPE_CHAR = '\\'
EXCEL_SECTION_DIVIDER = ';'

def convert_excel_date_format_string(excel_date):
    '''
    Created using documentation here:
    https://support.office.com/en-us/article/review-guidelines-for-customizing-a-number-format-c0a1d1fa-d3f4-4018-96b7-9c9354dd99f5

    '''
    # The python date string that is being built
    python_date = ''
    # The excel code currently being parsed
    excel_code = ''
    prev_code = ''
    # If the previous character was the escape character
    char_escaped = False
    # If we are in a quotation block (surrounded by "")
    quotation_block = False
    # Variables used for checking if a code should be a minute or a month
    checking_minute_or_month = False
    minute_or_month_buffer = ''

    for c in excel_date:
        ec = excel_code.lower()
        # The previous character was an escape, the next character should be added normally
        if char_escaped:
            if checking_minute_or_month:
                minute_or_month_buffer += c
            else:
                python_date += c
            char_escaped = False
            continue
        # Inside a quotation block
        if quotation_block:
            if c == '"':
                # Quotation block should now end
                quotation_block = False
            elif checking_minute_or_month:
                minute_or_month_buffer += c
            else:
                python_date += c
            continue
        # The start of a quotation block
        if c == '"':
            quotation_block = True
            continue
        if c == EXCEL_SECTION_DIVIDER:
            # We ignore excel sections for datetimes
            break

        is_escape_char = c == EXCEL_ESCAPE_CHAR
        # The am/pm and a/p code add some complications, need to make sure we are not that code
        is_misc_char = c in EXCEL_MISC_CHARS and (c != '/' or (ec != 'am' and ec != 'a'))
        # Code is finished, check if it is a proper code
        if (is_escape_char or is_misc_char) and ec:
            # Checking if the previous code should have been minute or month
            if checking_minute_or_month:
                if ec == 'ss' or ec == 's':
                    # It should be a minute!
                    minute_or_month_buffer = EXCEL_MINUTE_CODES[prev_code] + minute_or_month_buffer
                else:
                    # It should be a months!
                    minute_or_month_buffer = EXCEL_MONTH_CODES[prev_code] + minute_or_month_buffer
                python_date += minute_or_month_buffer
                checking_minute_or_month = False
                minute_or_month_buffer = ''

            if ec in EXCEL_CODES:
                python_date += EXCEL_CODES[ec]
            # Handle months/minutes differently
            elif ec in EXCEL_MINUTE_CODES:
                # If preceded by hours, we know this is referring to minutes
                if prev_code == 'h' or prev_code == 'hh':
                    python_date += EXCEL_MINUTE_CODES[ec]
                else:
                    # Have to check if the next code is ss or s
                    checking_minute_or_month = True
                    minute_or_month_buffer = ''
            else:
                # Have to abandon this attempt to convert because the code is not recognized
                return None
            prev_code = ec
            excel_code = ''
        if is_escape_char:
            char_escaped = True
        elif is_misc_char:
            # Add the misc char
            if checking_minute_or_month:
                minute_or_month_buffer += c
            else:
                python_date += c
        else:
            # Just add to the code
            excel_code += c

    # Complete, check if there is still a buffer
    if checking_minute_or_month:
        # We know it's a month because there were no more codes after
        minute_or_month_buffer = EXCEL_MONTH_CODES[prev_code] + minute_or_month_buffer
        python_date += minute_or_month_buffer
    if excel_code:
        ec = excel_code.lower()
        if ec in EXCEL_CODES:
            python_date += EXCEL_CODES[ec]
        elif ec in EXCEL_MINUTE_CODES:
            if prev_code == 'h' or prev_code == 'hh':
                python_date += EXCEL_MINUTE_CODES[ec]
            else:
                python_date += EXCEL_MONTH_CODES[ec]
        else:
            return None
    return python_date

Tested with python 3.6.7 using openpyxl 3.0.1

Upvotes: 3

stovfl
stovfl

Reputation: 15523

Question: Parse string ("yyyy-mm-dd\Thh:mm\Z") into a valid datetime.strftime Format Code.



import re
from datetime import datetime

class XLSXdatetime:
    translate = {'yyyy': '%Y', 'mm': '%m', 'dd': '%d', 
                 'hh:mm': '%H:%M', 'hh:mm:ss': '%H:%M:%S'}
    rec = re.compile(r'([\w:]+|\\.)')

    def __init__(self, xlsx_format):
        self.xlsx_format = xlsx_format

    @property
    def format(self):
        _format = []
        for item in XLSXdatetime.rec.findall(self.xlsx_format):
            if item.startswith('\\'):
                item = item[1:]
            _format.append(XLSXdatetime.translate.get(item, item))

        return ''.join(_format)

    def strftime(self, data):
        return data.strftime(self.format)

Usage:

  • data = datetime.strptime('2017-04-19 15:17:00.000004', '%Y-%m-%d %H:%M:%S.%f')
    print('data: {}'.format(data))
    
    # Long version
    for _format in ['yyyy-mm-dd hh:mm:ss', 
                    'yyyy\-mm\-dd\Thh:mm\Z'
                   ]:
        xlsx_datetime = XLSXdatetime(_format)    
        print("{} => {} = '{}'".format(_format, 
                                       xlsx_datetime.format, 
                                       xlsx_datetime.strftime(data)))
    

    Output:

    data: 2017-04-19 15:17:00.000004
    yyyy-mm-dd hh:mm:ss => %Y%m%d%H:%M:%S = '2017041915:17:00'
    yyyy\-mm\-dd\Thh:mm\Z => %Y-%m-%dT%H:%MZ = '2017-04-19T15:17Z'
    

  • # Short version
    for _format in ['yyyy-mm-dd hh:mm:ss', 
                    'yyyy\-mm\-dd\Thh:mm\Z'
                   ]:
        print("'{}'".format(XLSXdatetime(_format).strftime(data)))
    

    Output:

    data: 2017-04-19 15:17:00.000004
    '2017041915:17:00'
    '2017-04-19T15:17Z'
    

Tested with Python: 3.6

Upvotes: 0

Related Questions