Reputation: 1739
We use some OCR like PDF to Word converter which is the best we could find, but it uses the Symbol font table where for example, the degree symbol, appears as the code point U+F0B0, which is not a valid UNICODE point but it has a mapping to the proper UNICODE degree code point U+00B0. In fact all but one of the the Symbol font glyphs have a proper UNICODE character, but I am pulling my hair out not finding any table that would show a simple mapping.
This page http://www.alanwood.net/demos/symbol.html almost has it, but it doesn't actually show the Symbol font code points, but relies on some other mapping which, frankly, I don't understand at all. That same site has related pages but nowhere do I find F0B0 referenced for degree.
I found groff mappings of these special fonts to the old groff abbreviations, and it is the best I can get, there I can find in symbol.map a mapping of F0B0 to the abbreviation "de" and then I can find in text.map a mapping from 00B0 to "de". So if I was to reshape these two files to a relational table and then join on the abbreviation, I suppose I could create a mapping.
But I am stunned that nobody had to do that before? Anyone?
Upvotes: 2
Views: 824
Reputation: 1739
Ah well, I guess I didn't ask for a dissertation on the first principles of all possible symbolic fonts, no, I asked for whatever that Windows "Symbol" font is, that WGL4 code page or whatever I suppose "Monotype Symbol" font that is.
So here is what I did to generate the mapping from these groff font abbreviation maps I pointed to in my question:
wget https://opensource.apple.com/source/groff/groff-39/groff/font/devlj4/generate/symbol.map
sed -e '/^#/d' -e '/^ *$/d' -e 's/[\t ][\t ]*/|/g' symbol.map |cut -d\| -f2,3 |sort -t\| -k2 >symbol.map.dat
wget https://opensource.apple.com/source/groff/groff-39/groff/font/devlj4/generate/text.map
sed -e '/^#/d' -e '/^ *$/d' -e 's/[\t ][\t ]*/|/g' text.map |cut -d\| -f2,3 |sort -t\| -k2 >text.map.dat
wget https://opensource.apple.com/source/groff/groff-39/groff/font/devlj4/generate/special.map
sed -e '/^#/d' -e '/^ *$/d' -e 's/[\t ][\t ]*/|/g' special.map |cut -d\| -f2,3 |sort -t\| -k2 >special.map.dat
cat text.map.dat special.map.dat |sort -t\| -k2 > unicode.map.dat
join -t\| -1 2 -2 2 symbol.map.dat unicode.map.dat
Then I create an XML mapping table from that, which I use in my XSLT:
join -t\| -1 2 -2 2 symbol.map.dat unicode.map.dat |sed -e 's~\([^|]*\)|\([^|]*\)|\([^|]*\)~<a abb="\1" sym="\&#x\2;" uni="\&#x\3;"/>~'
this creates:
<a abb="" sym="" uni="≃"/>
<a abb="!" sym="" uni="!"/>
<a abb="!=" sym="" uni="≠"/>
<a abb="#" sym="" uni="#"/>
<a abb="%" sym="" uni="%"/>
<a abb="&" sym="" uni="&"/>
<a abb="(" sym="" uni="("/>
<a abb=")" sym="" uni=")"/>
<a abb="**" sym="" uni="∗"/>
<a abb="*A" sym="" uni="A"/>
<a abb="*B" sym="" uni="B"/>
<a abb="*C" sym="" uni="Ξ"/>
<a abb="*D" sym="" uni="∆"/>
<a abb="*E" sym="" uni="E"/>
<a abb="*F" sym="" uni="Φ"/>
<a abb="*G" sym="" uni="Γ"/>
<a abb="*H" sym="" uni="Θ"/>
<a abb="*I" sym="" uni="I"/>
<a abb="*K" sym="" uni="K"/>
<a abb="*L" sym="" uni="Λ"/>
<a abb="*M" sym="" uni="M"/>
<a abb="*N" sym="" uni="N"/>
<a abb="*O" sym="" uni="O"/>
<a abb="*P" sym="" uni="Π"/>
<a abb="*Q" sym="" uni="Ψ"/>
<a abb="*R" sym="" uni="P"/>
<a abb="*S" sym="" uni="Σ"/>
<a abb="*T" sym="" uni="T"/>
<a abb="*U" sym="" uni="Υ"/>
<a abb="*W" sym="" uni="Ω"/>
<a abb="*X" sym="" uni="X"/>
<a abb="*Y" sym="" uni="H"/>
<a abb="*Z" sym="" uni="Z"/>
<a abb="*a" sym="" uni="α"/>
<a abb="*b" sym="" uni="β"/>
<a abb="*c" sym="" uni="ξ"/>
<a abb="*d" sym="" uni="δ"/>
<a abb="*e" sym="" uni=""/>
<a abb="*f" sym="" uni="φ"/>
<a abb="*g" sym="" uni="γ"/>
<a abb="*h" sym="" uni="θ"/>
<a abb="*i" sym="" uni="ι"/>
<a abb="*k" sym="" uni="κ"/>
<a abb="*l" sym="" uni="λ"/>
<a abb="*m" sym="" uni="μ"/>
<a abb="*n" sym="" uni="ν"/>
<a abb="*o" sym="" uni="ο"/>
<a abb="*p" sym="" uni="π"/>
<a abb="*q" sym="" uni="ψ"/>
<a abb="*r" sym="" uni="ρ"/>
<a abb="*s" sym="" uni="σ"/>
<a abb="*t" sym="" uni="τ"/>
<a abb="*u" sym="" uni="υ"/>
<a abb="*w" sym="" uni="ω"/>
<a abb="*x" sym="" uni="χ"/>
<a abb="*y" sym="" uni="η"/>
<a abb="*z" sym="" uni="ζ"/>
<a abb="+-" sym="" uni="±"/>
<a abb="+f" sym="" uni="ϕ"/>
<a abb="+h" sym="" uni="ϑ"/>
<a abb="+p" sym="" uni="ϖ"/>
<a abb="," sym="" uni=","/>
<a abb="->" sym="" uni="→"/>
<a abb="." sym="" uni="."/>
<a abb="/" sym="" uni="/"/>
<a abb="/_" sym="" uni="∠"/>
<a abb="0" sym="" uni="0"/>
<a abb="1" sym="" uni="1"/>
<a abb="2" sym="" uni="2"/>
<a abb="3" sym="" uni="3"/>
<a abb="3d" sym="" uni="∴"/>
<a abb="4" sym="" uni="4"/>
<a abb="5" sym="" uni="5"/>
<a abb="6" sym="" uni="6"/>
<a abb="7" sym="" uni="7"/>
<a abb="8" sym="" uni="8"/>
<a abb="9" sym="" uni="9"/>
<a abb=":" sym="" uni=":"/>
<a abb=";" sym="" uni=";"/>
<a abb="<" sym="" uni="<"/>
<a abb="<-" sym="" uni="←"/>
<a abb="<=" sym="" uni="≤"/>
<a abb="<>" sym="" uni="↔"/>
<a abb="=" sym="" uni="="/>
<a abb="==" sym="" uni="≡"/>
<a abb="=~" sym="" uni="≅"/>
<a abb=">" sym="" uni=">"/>
<a abb=">=" sym="" uni="≥"/>
<a abb="?" sym="" uni="?"/>
<a abb="AN" sym="" uni="∧"/>
<a abb="Ah" sym="" uni="ℵ"/>
<a abb="CL" sym="" uni="♣"/>
<a abb="CR" sym="" uni="↵"/>
<a abb="DI" sym="" uni="♦"/>
<a abb="Eu" sym="" uni="€"/>
<a abb="HE" sym="" uni="♥"/>
<a abb="Im" sym="" uni="ℑ"/>
<a abb="OR" sym="" uni="∨"/>
<a abb="Re" sym="" uni="ℜ"/>
<a abb="SP" sym="" uni="♠"/>
<a abb="[" sym="" uni="["/>
<a abb="]" sym="" uni="]"/>
<a abb="_" sym="" uni="_"/>
<a abb="ap" sym="" uni="~"/>
<a abb="arrowvertbt" sym="" uni="⇓"/>
<a abb="arrowverttp" sym="" uni="⇑"/>
<a abb="c*" sym="" uni="⊗"/>
<a abb="c+" sym="" uni="⊕"/>
<a abb="ca" sym="" uni="∩"/>
<a abb="cu" sym="" uni="∪"/>
<a abb="da" sym="" uni="↓"/>
<a abb="de" sym="" uni="°"/>
<a abb="di" sym="" uni="÷"/>
<a abb="es" sym="" uni="∅"/>
<a abb="f/" sym="" uni="∕"/>
<a abb="fa" sym="" uni="∀"/>
<a abb="fm" sym="" uni="′"/>
<a abb="gr" sym="" uni="∇"/>
<a abb="hA" sym="" uni="⇔"/>
<a abb="ib" sym="" uni="⊆"/>
<a abb="if" sym="" uni="∞"/>
<a abb="integral" sym="" uni="∫"/>
<a abb="ip" sym="" uni="⊇"/>
<a abb="lA" sym="" uni="⇐"/>
<a abb="la" sym="" uni="〈"/>
<a abb="lz" sym="" uni="◇"/>
<a abb="mi" sym="" uni="−"/>
<a abb="mo" sym="" uni="∈"/>
<a abb="mu" sym="" uni="×"/>
<a abb="nb" sym="" uni="⊄"/>
<a abb="nm" sym="" uni="∉"/>
<a abb="no" sym="" uni="¬"/>
<a abb="pd" sym="" uni="∂"/>
<a abb="pl" sym="" uni="+"/>
<a abb="pp" sym="" uni="⊥"/>
<a abb="product" sym="" uni="∏"/>
<a abb="pt" sym="" uni="∝"/>
<a abb="rA" sym="" uni="⇒"/>
<a abb="ra" sym="" uni="〉"/>
<a abb="sb" sym="" uni="⊂"/>
<a abb="sd" sym="" uni="″"/>
<a abb="sp" sym="" uni="⊃"/>
<a abb="st" sym="" uni="∍"/>
<a abb="sum" sym="" uni="∑"/>
<a abb="te" sym="" uni="∃"/>
<a abb="ts" sym="" uni="ς"/>
<a abb="u2026" sym="" uni="…"/>
<a abb="u2320" sym="" uni="⌠"/>
<a abb="u2321" sym="" uni="⌡"/>
<a abb="ua" sym="" uni="↑"/>
<a abb="wp" sym="" uni="℘"/>
<a abb="~=" sym="" uni="≈"/>
or I can also create a lookup string of these invalid UNICODE points and a string of the position-matched proper UNICODE point:
join -t\| -1 2 -2 2 symbol.map.dat unicode.map.dat |sed -e 's~\([^|]*\)|\([^|]*\)|\([^|]*\)~\1|\&#x\2;|\&#x\3;~' > symbol-unicode.map.dat
echo '<a sym="'$(cut -d\| -f2 symbol-unicode.map.dat |tr -d '\n')'" uni="'$(cut -d\| -f3 symbol-unicode.map.dat |tr -d '\n')'"/>'
which gives me:
<a sym=""
uni="≃!≠#%&()∗ABΞ∆EΦΓΘIKΛMNOΠΨPΣTΥΩXHZαβξδφγθικλμνοπψρστυωχηζ±ϕϑϖ,→./∠0123∴456789:;<←≤↔=≡≅>≥?∧ℵ♣↵♦€♥ℑ∨ℜ♠[]_~⇓⇑⊗⊕∩∪↓°÷∅∕∀′∇⇔⊆∞∫⊇⇐〈◇−∈×⊄∉¬∂+⊥∏∝⇒〉⊂″⊃∍∑∃ς…⌠⌡↑℘≈">
By the way, there is a funny thing about the Stack Exchange platform that I can show you the symbols I have here, first the bad ones, which will probably show up all as boxes, unless you tweak your local CSS style="font-family: 'Symbol';"
:
and now the UNIICODE string:
≃!≠#%&()∗ABΞ∆EΦΓΘIKΛMNOΠΨPΣTΥΩXHZαβξδφγθικλμνοπψρστυωχηζ±ϕϑϖ,→./∠0123∴456789:;<←≤↔=≡≅>≥?∧ℵ♣↵♦€♥ℑ∨ℜ♠[]_~⇓⇑⊗⊕∩∪↓°÷∅∕∀′∇⇔⊆∞∫⊇⇐〈◇−∈×⊄∉¬∂+⊥∏∝⇒〉⊂″⊃∍∑∃ς…⌠⌡↑℘≈
Pretty neat that.
Perhaps it can help someone else struggling with the same issue needing a quick practical solution. You're welcome.
Upvotes: 2