Reputation: 789

How to merge multiple .xls files with hyperlinks in python?

I am trying to merge multiple .xls files that have many columns, but 1 column with hyperlinks. I try to do this with Python but keep running into unsolvable errors.

Just to be concise, the hyperlinks are hidden under a text section. The following ctrl-click hyperlink is an example of what I encounter in the .xls files: ES2866911 (T3).

In order to improve reproducibility, I have added .xls1 and .xls2 samples below.

xls1:

Title	Publication_Number
P_A	ES2866911 (T3)
P_B	EP3887362 (A1)

.xls2:

Title	Publication_Number
P_C	AR118706 (A2)
P_D	ES2867600 (T3)

Desired outcome:

Title	Publication_Number
P_A	ES2866911 (T3)
P_B	EP3887362 (A1)
P_C	AR118706 (A2)
P_D	ES2867600 (T3)

I am unable to get .xls file into Python without losing formatting or losing hyperlinks. In addition I am unable to convert .xls files to .xlsx. I have no possibility to acquire the .xls files in .xlsx format. Below I briefly summarize what I have tried:

1.) Reading with pandas was my first attempt. Easy to do, but all hyperlinks are lost in PD, furthermore all formatting from original file is lost.

2.) Reading .xls files with openpyxl.load

InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format.

3.) Converting .xls files to .xlsx

from xls2xlsx import XLS2XLSX
x2x = XLS2XLSX(input.file.xls)
wb = x2x.to_xlsx()
x2x.to_xlsx('output_file.xlsx')
TypeError: got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element

import pyexcel as p
p.save_book_as(file_name=input_file.xls, dest_file_name=export_file.xlsx)
TypeError: got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element
During handling of the above exception, another exception occurred:
StopIteration

4.) Even if we are able to read the .xls file with xlrd for example (meaning we will never be able to save the file as .xlsx, I can't even see the hyperlink:

import xlrd
wb = xlrd.open_workbook(file) # where vis.xls is your test file
ws = wb.sheet_by_name('Sheet1')
ws.cell(5, 1).value   
'AR118706 (A2)' #Which is the name, not hyperlink

5.) I tried installing older versions of openpyxl==3.0.1 to overcome type error to no succes. I tried to open .xls file with openpyxl with xlrd engine, similar typerror "xml.entree.elementtree.element' error occured. I tried many ways to batch convert .xls files to .xlsx all with similar errors.

Obviously I can just open with excel and save as .xlsx but this defeats the entire purpose, and I can't do that for 100's of files.

Upvotes: 2

Answers (4)

Rivered

Reputation: 789

Inspired by @Kunal, I managed to write code that avoids using Pandas libraries. .xls files are read by xlrd, and written to a new excel file by xlwt. Hyperlinks are maintened, and output file was saved as .xlsx format:

import os
import xlwt
from xlrd import open_workbook

# read and combine data
directory = "random_directory"
required_files = os.listdir(directory)

#Define new file and sheet to get files into
new_file = xlwt.Workbook(encoding='utf-8', style_compression = 0)
new_sheet = new_file.add_sheet('Sheet1', cell_overwrite_ok = True)

#Initialize header row, can be done with any file 
old_file = open_workbook(directory+"/"+required_files[0], formatting_info=True)
old_sheet = old_file.sheet_by_index(0)
for column in list(range(0, old_sheet.ncols)):
    new_sheet.write(0, column, old_sheet.cell(0, column).value) #To create header row

#Add rows from all files present in folder 
for file in required_files:
    old_file = open_workbook(directory+"/"+file, formatting_info=True) 
    old_sheet = old_file.sheet_by_index(0) #Define old sheet
    hyperlink_map = old_sheet.hyperlink_map #Create map of all hyperlinks
    for row in range(1, old_sheet.nrows): #We need all rows except header row
        if row-1 < len(hyperlink_map.items()): #Statement to ensure we do not go out of range on the lower side of hyperlink_map.items()
            Row_depth=len(new_sheet._Worksheet__rows) #We need row depth to know where to add new row           
            for col in list(range(old_sheet.ncols)): #For every column we need to add row cell
                if col is 1: #We need to make an exception for column 2 being the hyperlinked column
                    click=list(hyperlink_map.items())[row-1][1].url_or_path #define URL
                    new_sheet.write(Row_depth, col, xlwt.Formula('HYPERLINK("{}", "{}")'.format(click, old_sheet.cell(row, 1).value)))
                else: #If not hyperlinked column
                    new_sheet.write(Row_depth, col, old_sheet.cell(row, col).value) #Write cell

new_file.save("random_directory/output_file.xlsx")

Upvotes: 1

Johan Martinson

Reputation: 46

I assume the same as daedalus in terms of the excel files. Instead of pandas I use openpyxl to read and create a new excel file.

import openpyxl

wb1 = openpyxl.load_workbook('tmp.xlsx')
ws1 = wb.get_sheet_by_name('Sheet1')

wb2 = openpyxl.load_workbook('tmp2.xlsx')
ws2 = wb.get_sheet_by_name('Sheet1')

csvDict = {}

# Go through first sheet to find the hyperlinks and keys.
for (row in ws1.max_row):
    hyperlink_dict[ws1.cell(row=row, column=1).value] =
       [ws1.cell(row=row, column=2).hyperlink.target,
        ws1.cell(row=row, column=2).value]
 
# Go Through second sheet to find hyperlinks and keys.
for (row in ws2.max_row):
    hyperlink_dict[ws2.cell(row=row, column=1).value] =
       [ws2.cell(row=row, column=2).hyperlink.target,
        ws2.cell(row=row, column=2).value]

Now you have all the data so you can create a new workbook and save the values from the dict into it via opnenpyxl.

wb = Workbook(write_only=true)
ws = wb.create_sheet()

for irow in len(csvDict):
    #use ws.append() to add the data from the csv.

wb.save('new_big_file.xlsx')

https://openpyxl.readthedocs.io/en/stable/optimized.html#write-only-mode

Upvotes: 0

Kunal Shah

Reputation: 107

You need to use xlrd library to read the hyperlinks properly, pandas to merge all data together and xlsxwriter to write the data properly. Assuming all input files have same format, you can use below code.

# imports
import os
import xlrd
import xlsxwriter
import pandas as pd

# required functions
def load_excel_to_df(filepath, hyperlink_col):
    book = xlrd.open_workbook(file_path)
    sheet = book.sheet_by_index(0)
    hyperlink_map = sheet.hyperlink_map
    
    data = pd.read_excel(filepath)
    hyperlink_col_index = list(data.columns).index(hyperlink_col)
    
    required_links = [v.url_or_path for k, v in hyperlink_map.items() if k[1] == hyperlink_col_index]
    data['hyperlinks'] = required_links
    return data

# main code
# set required variables
input_data_dir = 'path/to/input/data/'
hyperlink_col = 'Publication_Number'
output_data_dir = 'path/to/output/data/'
output_filename = 'combined_data.xlsx'

# read and combine data
required_files = os.listdir(input_data_dir)
combined_data = pd.DataFrame()
for file in required_files:
    curr_data = load_excel_to_df(data_dir + os.sep + file, hyperlink_col)
    combined_data = combined_data.append(curr_data, sort=False, ignore_index=True)
cols = list(combined_data.columns)
m, n = combined_data.shape
hyperlink_col_index = cols.index(hyperlink_col)

# writing data
writer = pd.ExcelWriter(output_data_dir + os.sep + output_filename, engine='xlsxwriter')
combined_data[cols[:-1]].to_excel(writer, index=False, startrow=1, header=False) # last column contains hyperlinks
workbook  = writer.book
worksheet = writer.sheets[list(workbook.sheetnames.keys())[0]]
for i, col in enumerate(cols[:-1]):
    worksheet.write(0, i, col)
for i in range(m):
    worksheet.write_url(i+1, hyperlink_col_index, combined_data.loc[i, cols[-1]], string=combined_data.loc[i, hyperlink_col])
writer.save()

References:

reading hyperlinks - https://stackoverflow.com/a/7057076/17256762
pandas to_excel header formatting - Remove default formatting in header when converting pandas DataFrame to excel sheet
writing hyperlinks with xlsxwriter - https://xlsxwriter.readthedocs.io/example_hyperlink.html

Upvotes: 5

daedalus

Reputation: 10923

Without a clear reproducible example, the problem is not clear. Assume I have two files called tmp.xls and tmp2.xls containing dummy data as in the two screenshots below.

Then pandas can easily, load, concatenate, and convert to .xlsx format without loss of hyperlinks. Here is some demo code and the resulting file:

import pandas as pd

f1 = pd.read_excel('tmp.xls')
f2 = pd.read_excel('tmp2.xls')

f3 = pd.concat([f1, f2], ignore_index=True)

f3.to_excel('./f3.xlsx')

Upvotes: 2

How to merge multiple .xls files with hyperlinks in python?

Answers (4)

Related Questions