Jason
Jason

Reputation: 21

Convert ALTO XML to formatted PDF/RTF/TXT?

I am looking to batch convert a large amount of ALTO format XML docs to various formats in Windows, txt at least, rtf if possible and pdf would be convenient as well.

ALTO is an xml standard used by libraries and archives to hold metadata/format/font/layout aware text for reconstruction in PDF images.

I have only the XML files for a large archive that I would like to convert for text mining. The software I am using requires clean text or rtf files, so converting the xml to plain text is kind of the goal. Because ALTO is a standard the conversion should be possible, no?

A bonus would be the ability to either embed the metadata in a pdf or convert it to a bibliographical format file like LaTex. This could be a separate program.

I'd appreciate any ideas,

Thanks.

Upvotes: 2

Views: 1938

Answers (1)

cneud
cneud

Reputation: 11

In order to get plain text from the ALTO xml, you may try implementing the simple method used in this (hacky) Python script in Java: https://github.com/cneud/alto-ocr-text.

I am not currently aware of a straight conversion to PDF or LaTeX though you may be able to do this with a stylesheet, based on how exactly your ALTO files look like.

Upvotes: 1

Related Questions