Reputation: 354
I've been struggling with this issue for a while. I've tried with multiple python excel libraries and they all seem to have the same issue. The end desired result is essentially "what you see is what you get" for xlsx files. All of the python libraries that interact with excel return the value that is stored within excel and possibly the corresponding format for that value. I am struggling to then use that format to actually get a value that looks like what you see in excel or another spreadsheet application like libre office calc.
Say we have a sheet with a row that looks something like this:
The format (shown using libre office calc) is here:
And now here is some code to open the sheet and output the stored value and the format
import openpyxl
book = openpyxl.load_workbook(
'test.xlsx',
read_only=True,
data_only=False,
)
sheet = book.worksheets[0]
for row in sheet.iter_rows():
for cell in row:
print('FORMAT:', cell.number_format)
print('VALUE:', cell.value)
print('TYPE:', type(cell.value))
Running that code (python 3.6.7, openpyxl 3.0.1) produces the following truncated output:
FORMAT: yyyy\-mm\-dd\Thh:mm\Z
VALUE: 2017-04-19 15:17:00.000004
TYPE: <class 'datetime.datetime'>
...
My question is, how do you parse that format string (yyyy-mm-dd\Thh:mm\Z) into a valid python strftime datetime representation. I started writing a simple function that uses string replace to replace yyyy
with %Y
, yy
with %y
and so on. But then I noticed that there are two instances of mm
in the format string, one corresponding to months and one corresponding to minutes! How are you expected to parse that? Does month always come first? What happens when there is only minutes? And what if you want a datetime format with the time first, date second?
Any help would be greatly appreciated. Either a python library that already does this, a well documented specification of the xlsx file format that would allow me to build my own parser (I found this but it didn't seem to have what I wanted: https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/300280fd-e4fe-4675-a924-4d383af48d3b), or an example in another language. It would also be great if this could be generalized beyond dates and always be used for all excel formatting.
Upvotes: 0
Views: 1585
Reputation: 354
Unfortunately, stovfl's solution was not actually generalized to all xlsx formats. After much searching through Microsoft documentation I was finally able to find this page that documents some of the rules of excel number_format.
Important things to note:
sections
, separated by a semicolon. For the purpose of this solution I chose to ignore sections (because if the sections are actually used the resulting output will not look like a date).mmmmm
displays the first letter of the month. For my solution I chose to replace these codes with similar strftime codes (for mmmmm
I chose %b
, which displays an abbreviation of the month). I noted these codes in commentsAnyways, I just built a function that given an excel number_format date string, returns the python strftime equivalent. I hope this can help someone looking for a way to get "What you see is what you get" from excel to python.
EXCEL_CODES = {
'yyyy': '%Y',
'yy': '%y',
'dddd': '%A',
'ddd': '%a',
'dd': '%d',
'd': '%-d',
# Different from excel as there is no J-D in strftime
'mmmmmm': '%b',
'mmmm': '%B',
'mmm': '%b',
'hh': '%H',
'h': '%-H',
'ss': '%S',
's': '%-S',
# Possibly different from excel as there is no am/pm in strftime
'am/pm': '%p',
# Different from excel as there is no A/P or a/p in strftime
'a/p': '%p',
}
EXCEL_MINUTE_CODES = {
'mm': '%M',
'm': '%-M',
}
EXCEL_MONTH_CODES = {
'mm': '%m',
'm': '%-m',
}
EXCEL_MISC_CHARS = [
'$',
'+',
'(',
':',
'^',
'\'',
'{',
'<',
'=',
'-',
'/',
')',
'!',
'&',
'~',
'}',
'>',
' ',
]
EXCEL_ESCAPE_CHAR = '\\'
EXCEL_SECTION_DIVIDER = ';'
def convert_excel_date_format_string(excel_date):
'''
Created using documentation here:
https://support.office.com/en-us/article/review-guidelines-for-customizing-a-number-format-c0a1d1fa-d3f4-4018-96b7-9c9354dd99f5
'''
# The python date string that is being built
python_date = ''
# The excel code currently being parsed
excel_code = ''
prev_code = ''
# If the previous character was the escape character
char_escaped = False
# If we are in a quotation block (surrounded by "")
quotation_block = False
# Variables used for checking if a code should be a minute or a month
checking_minute_or_month = False
minute_or_month_buffer = ''
for c in excel_date:
ec = excel_code.lower()
# The previous character was an escape, the next character should be added normally
if char_escaped:
if checking_minute_or_month:
minute_or_month_buffer += c
else:
python_date += c
char_escaped = False
continue
# Inside a quotation block
if quotation_block:
if c == '"':
# Quotation block should now end
quotation_block = False
elif checking_minute_or_month:
minute_or_month_buffer += c
else:
python_date += c
continue
# The start of a quotation block
if c == '"':
quotation_block = True
continue
if c == EXCEL_SECTION_DIVIDER:
# We ignore excel sections for datetimes
break
is_escape_char = c == EXCEL_ESCAPE_CHAR
# The am/pm and a/p code add some complications, need to make sure we are not that code
is_misc_char = c in EXCEL_MISC_CHARS and (c != '/' or (ec != 'am' and ec != 'a'))
# Code is finished, check if it is a proper code
if (is_escape_char or is_misc_char) and ec:
# Checking if the previous code should have been minute or month
if checking_minute_or_month:
if ec == 'ss' or ec == 's':
# It should be a minute!
minute_or_month_buffer = EXCEL_MINUTE_CODES[prev_code] + minute_or_month_buffer
else:
# It should be a months!
minute_or_month_buffer = EXCEL_MONTH_CODES[prev_code] + minute_or_month_buffer
python_date += minute_or_month_buffer
checking_minute_or_month = False
minute_or_month_buffer = ''
if ec in EXCEL_CODES:
python_date += EXCEL_CODES[ec]
# Handle months/minutes differently
elif ec in EXCEL_MINUTE_CODES:
# If preceded by hours, we know this is referring to minutes
if prev_code == 'h' or prev_code == 'hh':
python_date += EXCEL_MINUTE_CODES[ec]
else:
# Have to check if the next code is ss or s
checking_minute_or_month = True
minute_or_month_buffer = ''
else:
# Have to abandon this attempt to convert because the code is not recognized
return None
prev_code = ec
excel_code = ''
if is_escape_char:
char_escaped = True
elif is_misc_char:
# Add the misc char
if checking_minute_or_month:
minute_or_month_buffer += c
else:
python_date += c
else:
# Just add to the code
excel_code += c
# Complete, check if there is still a buffer
if checking_minute_or_month:
# We know it's a month because there were no more codes after
minute_or_month_buffer = EXCEL_MONTH_CODES[prev_code] + minute_or_month_buffer
python_date += minute_or_month_buffer
if excel_code:
ec = excel_code.lower()
if ec in EXCEL_CODES:
python_date += EXCEL_CODES[ec]
elif ec in EXCEL_MINUTE_CODES:
if prev_code == 'h' or prev_code == 'hh':
python_date += EXCEL_MINUTE_CODES[ec]
else:
python_date += EXCEL_MONTH_CODES[ec]
else:
return None
return python_date
Tested with python 3.6.7 using openpyxl 3.0.1
Upvotes: 3
Reputation: 15523
Question: Parse string (
"yyyy-mm-dd\Thh:mm\Z"
) into a validdatetime.strftime
Format Code.
import re
from datetime import datetime
class XLSXdatetime:
translate = {'yyyy': '%Y', 'mm': '%m', 'dd': '%d',
'hh:mm': '%H:%M', 'hh:mm:ss': '%H:%M:%S'}
rec = re.compile(r'([\w:]+|\\.)')
def __init__(self, xlsx_format):
self.xlsx_format = xlsx_format
@property
def format(self):
_format = []
for item in XLSXdatetime.rec.findall(self.xlsx_format):
if item.startswith('\\'):
item = item[1:]
_format.append(XLSXdatetime.translate.get(item, item))
return ''.join(_format)
def strftime(self, data):
return data.strftime(self.format)
Usage:
data = datetime.strptime('2017-04-19 15:17:00.000004', '%Y-%m-%d %H:%M:%S.%f')
print('data: {}'.format(data))
# Long version
for _format in ['yyyy-mm-dd hh:mm:ss',
'yyyy\-mm\-dd\Thh:mm\Z'
]:
xlsx_datetime = XLSXdatetime(_format)
print("{} => {} = '{}'".format(_format,
xlsx_datetime.format,
xlsx_datetime.strftime(data)))
Output:
data: 2017-04-19 15:17:00.000004 yyyy-mm-dd hh:mm:ss => %Y%m%d%H:%M:%S = '2017041915:17:00' yyyy\-mm\-dd\Thh:mm\Z => %Y-%m-%dT%H:%MZ = '2017-04-19T15:17Z'
# Short version
for _format in ['yyyy-mm-dd hh:mm:ss',
'yyyy\-mm\-dd\Thh:mm\Z'
]:
print("'{}'".format(XLSXdatetime(_format).strftime(data)))
Output:
data: 2017-04-19 15:17:00.000004 '2017041915:17:00' '2017-04-19T15:17Z'
Tested with Python: 3.6
Upvotes: 0