Reputation: 1

How to read the hyperlinks from an .ods file?

I have a .ods file, which looks pretty similar to a usual excel file. This file contains a column with hyperlinks in it. To parse the full document, I've been using the pandas read_excel() method and works perfectly well for the raw data, but the hyperlinks are lost.

Do anyone know of a solution to parse hyperlinks from an .ods file?

Looking on the web, users propose to use openpyxl to load the workbook and extract the hyperlinks, but openpyxl do not support .ods format.

Upvotes: -1

Answers (2)

moken

Reputation: 6630

There are some ways to convert ods to xlsx but most will result in the loss of the hyperlinks.
You can use libreoffice if installed to do the conversion as if you were saving the file from the application.

This example is for Windows but the same can be run on Linux by changing the libreoffice exe and paths.
The libreoffice file on windows is called 'soffice.exe' and we run it in headless mode like;

soffice --headless --invisible --convert-to xlsx hyperlink.ods

This will produce the file hyperlink.xlsx.
In this example we are also using the param --outdir to dump the converted file to a temp directory tempdir.

Once the file is converted it can be opened by Pandas and Openpyxl etc.

from openpyxl import load_workbook
import pandas as pd
import subprocess
import os

### File locations and names'
tempdir = 'C:/temp'
liboffice_path = "C:/Program Files/LibreOffice/program/"
liboffice_exe = 'soffice.exe'
liboffice_args = ' --headless --invisible --convert-to xlsx --outdir '

filepath = 'C:/ConvertODS-XLSX/' ### Location of the .ods file 
filename = 'hyperlink.ods'  ### Name of ods file to convert
convert_to = 'xlsx'  ### Format to convert ods file to

STDNULL = open(os.devnull, 'w')

libfile = os.path.join(liboffice_path, liboffice_exe)
odsfile = os.path.join(filepath, filename)

### Run the conversion 
subprocess.call(f"{libfile} "
                "--headless --invisible "
                f"--convert-to {convert_to} "
                f"--outdir {tempdir} "
                f"{odsfile}",
                stdout=STDNULL, stderr=STDNULL, shell=False
                )

fileext = os.path.splitext(filename)[1]
converted_file = os.path.join(tempdir, filename.replace(fileext, f'.{convert_to}'))

df = pd.read_excel(converted_file)

wb = load_workbook(converted_file)
ws = wb['Sheet1']

for row in ws.iter_rows():
    for cell in row:
        if cell.hyperlink is not None:
            print(cell.hyperlink.target)
...

Upvotes: 0

Timeless

Reputation: 37827

You can try reading the .ods file as an archive and parse its content with beautifulsoup :

from zipfile import ZipFile
from bs4 import BeautifulSoup as BS

with ZipFile("file.ods", "r") as zf:
  soup = BS(zf.read("content.xml"), "xml")

data = [
    (cell.get_text(), cell["xlink:href"])
    for cell in soup.find_all("text:a")
]

df = pd.DataFrame(data, columns=["Name", "Url"])

An alternative, with odfpy :

#pip install odfpy
from odf.opendocument import load
from odf import text

opendoc = load("file.ods")

data = [
    (e.firstChild.data, e.getAttribute("href"))
    for e in opendoc.spreadsheet.getElementsByType(text.A)
]

df = pd.DataFrame(data, columns=["Name", "Url"])

Output :

print(df)

            Name                              Url
0  StackOverflow       https://stackoverflow.com/
1           Meta  https://meta.stackexchange.com/
2            GIS   https://gis.stackexchange.com/

Input used (file.ods) :

Upvotes: 2

How to read the hyperlinks from an .ods file?

Answers (2)

Related Questions