Reputation: 1
I have a .ods file, which looks pretty similar to a usual excel file. This file contains a column with hyperlinks in it. To parse the full document, I've been using the pandas read_excel() method and works perfectly well for the raw data, but the hyperlinks are lost.
Do anyone know of a solution to parse hyperlinks from an .ods file?
Looking on the web, users propose to use openpyxl to load the workbook and extract the hyperlinks, but openpyxl do not support .ods format.
Upvotes: -1
Views: 390
Reputation: 6630
There are some ways to convert ods
to xlsx
but most will result in the loss of the hyperlinks.
You can use libreoffice if installed to do the conversion as if you were saving the file from the application.
This example is for Windows but the same can be run on Linux by changing the libreoffice exe and paths.
The libreoffice file on windows is called 'soffice.exe' and we run it in headless mode like;
soffice --headless --invisible --convert-to xlsx hyperlink.ods
This will produce the file hyperlink.xlsx
.
In this example we are also using the param --outdir
to dump the converted file to a temp directory tempdir
.
Once the file is converted it can be opened by Pandas and Openpyxl etc.
from openpyxl import load_workbook
import pandas as pd
import subprocess
import os
### File locations and names'
tempdir = 'C:/temp'
liboffice_path = "C:/Program Files/LibreOffice/program/"
liboffice_exe = 'soffice.exe'
liboffice_args = ' --headless --invisible --convert-to xlsx --outdir '
filepath = 'C:/ConvertODS-XLSX/' ### Location of the .ods file
filename = 'hyperlink.ods' ### Name of ods file to convert
convert_to = 'xlsx' ### Format to convert ods file to
STDNULL = open(os.devnull, 'w')
libfile = os.path.join(liboffice_path, liboffice_exe)
odsfile = os.path.join(filepath, filename)
### Run the conversion
subprocess.call(f"{libfile} "
"--headless --invisible "
f"--convert-to {convert_to} "
f"--outdir {tempdir} "
f"{odsfile}",
stdout=STDNULL, stderr=STDNULL, shell=False
)
fileext = os.path.splitext(filename)[1]
converted_file = os.path.join(tempdir, filename.replace(fileext, f'.{convert_to}'))
df = pd.read_excel(converted_file)
wb = load_workbook(converted_file)
ws = wb['Sheet1']
for row in ws.iter_rows():
for cell in row:
if cell.hyperlink is not None:
print(cell.hyperlink.target)
...
Upvotes: 0
Reputation: 37827
You can try reading the .ods
file as an archive and parse its content with beautifulsoup :
from zipfile import ZipFile
from bs4 import BeautifulSoup as BS
with ZipFile("file.ods", "r") as zf:
soup = BS(zf.read("content.xml"), "xml")
data = [
(cell.get_text(), cell["xlink:href"])
for cell in soup.find_all("text:a")
]
df = pd.DataFrame(data, columns=["Name", "Url"])
An alternative, with odfpy
:
#pip install odfpy
from odf.opendocument import load
from odf import text
opendoc = load("file.ods")
data = [
(e.firstChild.data, e.getAttribute("href"))
for e in opendoc.spreadsheet.getElementsByType(text.A)
]
df = pd.DataFrame(data, columns=["Name", "Url"])
Output :
print(df)
Name Url
0 StackOverflow https://stackoverflow.com/
1 Meta https://meta.stackexchange.com/
2 GIS https://gis.stackexchange.com/
Input used (file.ods
) :
Upvotes: 2