Dwight schrute
Dwight schrute

Reputation: 11

How to load xls file from an Amazon S3 bucket and convert to xlsx and save to Amazon S3

I am trying to load an xls file and convert to xlsx from an Amazon S3 bucket.

really appreciate if someone knows how to do:

  1. read xls file from s3
  2. convert xls to xlsx and save in s3.

Constrains: 1.This is an enterprise projet thus cannot be downloading the file to my local and convert and reupload 2. our internal application is running on EC2 linux, thus cannot install packages that works for windows like win32.client to do the conversion.

thank you!!

Upvotes: 0

Views: 1300

Answers (1)

droebi
droebi

Reputation: 946

Try following: When you run it inside AWS: Python: How to read and load an excel file from AWS S3?

When your run it local:

  1. Download this file:
  1. Convert xls to xlsx:

     def xls_to_xlsx(*args, **kw):
         book_xls = xlrd.open_workbook(*args, formatting_info=True, ragged_rows=True, **kw)
         book_xlsx = openpyxl.workbook.Workbook()
    
         sheet_names = book_xls.sheet_names()
         for sheet_index in range(len(sheet_names)):
             sheet_xls = book_xls.sheet_by_name(sheet_names[sheet_index])
             if sheet_index == 0:
                 sheet_xlsx = book_xlsx.active
                 sheet_xlsx.title = sheet_names[sheet_index]
             else:
                 sheet_xlsx = book_xlsx.create_sheet(title=sheet_names[sheet_index])
             for crange in sheet_xls.merged_cells:
                 rlo, rhi, clo, chi = crange
                 sheet_xlsx.merge_cells(start_row=rlo + 1, end_row=rhi,
                                        start_column=clo + 1, end_column=chi, )
    
             def _get_xlrd_cell_value(cell):
                 value = cell.value
                 if cell.ctype == xlrd.XL_CELL_DATE:
                     if value == 1.0:
                         datetime_tup = (0, 0, 0)
                     else:
                         datetime_tup = xlrd.xldate_as_tuple(value, 0)
    
                     if datetime_tup[0:3] == (0, 0, 0):
                         value = datetime.time(*datetime_tup[3:])
                     else:
                         value = datetime.datetime(*datetime_tup)
                 return value
    
             for row in range(sheet_xls.nrows):
                 sheet_xlsx.append((
                     _get_xlrd_cell_value(cell)
                     for cell in sheet_xls.row_slice(row, end_colx=sheet_xls.row_len(row))
                 ))
    
         return book_xlsx
    
  2. Upload it back to S3:

Upvotes: 1

Related Questions