Reputation: 6869
I have an Excel spreadsheet that I am reading into a Pandas DataFrame:
df = pd.read_excel("file.xls")
However, one of the columns of the spreadsheet contains text which have a hyperlink associated with it. How do I access the underlying hyperlink in Pandas?
Upvotes: 19
Views: 27888
Reputation: 108
I faced the same issue when I tried to copy excel file using pandas. I came up with the following code to solve this.
import pandas as pd
import openpyxl
def get_columnn_data_list(wb,column_number):
data = []
ws = wb["Sheet1"]
for row in ws.iter_rows(min_row=2):
value = row[column_number].value
data.append(value)
return data
df = pd.read_excel("input.xlsx")
wb = openpyxl.load_workbook("input.xlsx")
column_number = 3 # This is the number(index starts from 0 ) of the column with hyperlinks.
column_name = "column_name" # This is the name of the column with hyperlinks. I.E : the column name of column_number
df[column_name] = get_columnn_data_list(wb,column_number)
What happens here is that we're manually iterating through the excel sheet and collect the cell values with hyperlink for a given column number using the get_columnn_data_list
function. From the results we get, we're updating our data frame's relevant column. The results we get are in the form of
=HYPERLINK("https://www.example.com", "visible_text")
So if you want to get the url or the visible text, you might have to do some simple string splitting of your own.
Upvotes: 0
Reputation: 15777
This can be done with openpyxl, I'm not sure its possible with Pandas at all. Here's how I've done it:
import openpyxl
wb = openpyxl.load_workbook('yourfile.xlsm')
sheets = wb.sheetnames
ws = wb[sheets[0]]
# Deprecation warning
# ws = wb.get_sheet_by_name('Sheet1')
print(ws.cell(row=2, column=1).hyperlink.target)
You can also use iPython, and set a variable equal to the hyperlink object:
t = ws.cell(row=2, column=1).hyperlink
then do t.
and press tab to see all the options for what you can do with or access from the object.
Upvotes: 17
Reputation: 406
Quick monkey patching, without converters or anything like this, if you would like to treat ALL cells with hyperlinks as hyperlinks, more sophisticated way, I suppose, at least be able to choose, what columns treat as hyperlinked or gather data, or save somehow both data and hyperlink in same cell at dataframe. And using converters, dunno. (BTW I played also with data_only
, keep_links
, did not helped, only changing read_only
resulted ok, I suppose it can slow down your code speed).
P.S.: Works only with xlsx, i.e., engine is openpyxl
P.P.S.: If you reading this comment in the future and issue https://github.com/pandas-dev/pandas/issues/13439 still Open, don't forget to see changes in _convert_cell
and load_workbook
at pandas.io.excel._openpyxl
and update them accordingly.
import pandas
from pandas.io.excel._openpyxl import OpenpyxlReader
import numpy as np
from pandas._typing import FilePathOrBuffer, Scalar
def _convert_cell(self, cell, convert_float: bool) -> Scalar:
from openpyxl.cell.cell import TYPE_BOOL, TYPE_ERROR, TYPE_NUMERIC
# here we adding this hyperlink support:
if cell.hyperlink and cell.hyperlink.target:
return cell.hyperlink.target
# just for example, you able to return both value and hyperlink,
# comment return above and uncomment return below
# btw this may hurt you on parsing values, if symbols "|||" in value or hyperlink.
# return f'{cell.value}|||{cell.hyperlink.target}'
# here starts original code, except for "if" became "elif"
elif cell.is_date:
return cell.value
elif cell.data_type == TYPE_ERROR:
return np.nan
elif cell.data_type == TYPE_BOOL:
return bool(cell.value)
elif cell.value is None:
return "" # compat with xlrd
elif cell.data_type == TYPE_NUMERIC:
# GH5394
if convert_float:
val = int(cell.value)
if val == cell.value:
return val
else:
return float(cell.value)
return cell.value
def load_workbook(self, filepath_or_buffer: FilePathOrBuffer):
from openpyxl import load_workbook
# had to change read_only to False:
return load_workbook(
filepath_or_buffer, read_only=False, data_only=True, keep_links=False
)
OpenpyxlReader._convert_cell = _convert_cell
OpenpyxlReader.load_workbook = load_workbook
And after adding this above in your python file, you will be able to call df = pandas.read_excel(input_file)
After writing all this stuff it came to me, that maybe it would be easier and cleaner just use openpyxl by itself ^_^
Upvotes: 3
Reputation: 19
as commented by slaw it doesnt grab the hyperlink but only the text
here text.xlsx contains links in the 9th column
from openpyxl import load_workbook
workbook = load_workbook('test.xlsx')
worksheet = workbook.active
column_indices = [9]
for row in range(2, worksheet.max_row + 1):
for col in column_indices:
filelocation = worksheet.cell(column=col, row=row) # this is hyperlink
text = worksheet.cell(column=col + 1, row=row) # thi is your text
worksheet.cell(column=col + 1, row=row).value = '=HYPERLINK("' + filelocation.value + '","' + text.value + '")'
workbook.save('test.xlsx')
Upvotes: 1
Reputation: 12610
You cannot do that in pandas. You can try with other libraries designed to deal with excel files.
Upvotes: 0