Powerriegel
Powerriegel

Reputation: 627

Simplest way to convert subscript numbers

we get book titles from different sources (library systems) (with possibly different encoding, but mostly utf8). These strings are shown in the web and via export to Endnote and RefWorks. RefWorks (windows Quotation system) does not accept any other encoding than ANSI.

In the RIS/Refworks export, activating the line

$smarty = iconv("UTF-8", "Windows-1252", $smarty);

Example string

Diphosphen-komplexes (CO) 5CrPhPPPhCr(CO) 5

does suddenly cut off everything after the first subscript char (the rectangles). These chars are also not correctly printed in HTML but this output is okay because nothing is cut off. In UTF-8 export file encoding nothing is cut off, too. Despite that, the Windows software can't read UTF-8.

The simplest solution would be to convert any subscript number to a regular number. Everything would work quite well then. But I could not find any simple solution to this. Working with hex codes is the only thing I could imagine. This solutions is also preferred for use in our Solr index.

Anybody knows a better solutions?

Upvotes: 0

Views: 690

Answers (1)

Jukka K. Korpela
Jukka K. Korpela

Reputation: 201728

The example string contains Private Use code points such as U+E5F8. By definition, no standard assigns any meaning to them; their use is purely by private agreements. It is thus impossible to convert them to anything, or to do anything with them, without knowing or inferring the private agreements involved. Some systems use Private Use code points to represent some symbols that are assigned to those points in some special font. Knowing what that font is and inspecting it may thus help to find out the agreement.

The conversion would need to be coded separately, in an ad hoc manner, since there is an an hoc agreement involved.

“ANSI”, which here means windows-1252, does not contain any subscript characters. In the context of a chemical formula, replacing subscript digits by normal digits does not change the meaning, and the formula is understandable, though it looks unprofessional.

When converting to HTML format (or other rich text format), you can use normal digits wrapped in elements that cause subscript rendering (or otherwise style them). HTML has the sub element for this, but its implementations differ between browsers and tend to be a poor quality, so a better approach is to generate <span class=sub>...</span> and use CSS to set the vertical position and font size.

Upvotes: 1

Related Questions