Reputation: 4742
Many have probably experienced copying some text from Word into a website form or something, and all the quotes ('), double quotes ("), and dashes (-) get garbled. I believe the quotes are called "Smart Quotes" or "Typographer's Quotes", but I don't know the name of the dash. Is there a category that includes these characters? Are there more?
Discerning features of this category: Accessible with normal qwerty keyboard, and is easily visually mistakable for its ASCII equivalent.
This question seems to be dealing with the same issue: How do I convert Word smart quotes and em dashes in a string? Also, perhaps they are called "em dashes"?
Upvotes: 6
Views: 2600
Reputation: 1428
I have some unofficial names for this category of characters, but they all involve swear words. As far as I know, there is no official name for this category. My guess is that, if some organization were to make a category for them, it would be named something like Characters, represented by codepoints 0x0080-0x009F
, which are different in the first 255 codepoints of Windows-1252 and of Unicode.
Windows Office products (including Word), use an almost Unicode which are identical in all codepoint-to-symbol mappings of Unicode except those 0x0080-0x009F
codepoints/characters. These characters - or more precisely the byte representations of these characters' codepoints - are those which "always break". Note that there are more pesky characters (in the way you've described "pesky") than the smart quotes and dash (actually dashes, more will follow), which you can see in the images or the table below. In my experience, the most conspicuous problem causers are the symbol for horizontal ellipses (…
, as opposed to ...
) and the Euro symbol (€
, which will cause you more or fewer problems based on your location).
One of the software solutions for this issue, available for Python, is in the 'Unicode, Dammit!' portion of beautifulsoup4. For the pasting-from-Word issue, the detwingle
method is particularly useful. (Links to references, etc., are very near the bottom of this answer.) There are similar options available in other programming languages, but I don't know which ones are now the most useful.
There are a couple of extremely unofficial names - used by myself and some colleagues - for these characters, though none that I can use in polite company. Imagine kicking your shoes through the glass in your living room wall. Instead of your shoes being "breaking Windows things" ... well, let's say that this issue can be described by switching the first two words inside the quotes.
Basically, Microsoft decided that they would implement an almost-Unicode (archived) character page - i.e. they would use the same byte-formatting for strings as the UTF-8 encoding (archived) for Unicode, but they would make a few differences in the mapping of codepoints to symbols. I suppose these differences were made in order to make some useful characters available in the extended-ASCII (some say the ANSI) codepoints, which take fewer bytes to encode. (I should put some sources here, but I don't know where to find them, and I'm trying to hurry.) If a bunch of those words above don't make sense, don't worry (though a good primer is available at Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)) (archived).
The basic point is that Windows decided to almost conform to Unicode - Microsoft Office products use an almost-Unicode. In the following image, I've taken a graphic from a DEV Community page (archived). This is a great graphic that identifies a codepoint (archived) in the Windows-1252 (archived) encoding, by taking the number in the first vertical column followed by the number in the first horizontal row. To compare with Unicode, you can put two leading zeros. For example, you can find capital A ( 'A'
) at the intersection of '4_'
and '_1'
, so the codepoint for 'A'
is 41
or 0041
. The Unicode equivalent is shown in small characters underneath the symbol. For 'A'
, the Unicode equivalent is 0041
(see note 1). Windows's "version" of Unicode matches Unicode in all but a very few codepoints in the Latin-1 Supplement.
I've drawn a green rectangle around what, in Unicode, is called the "Latin-1 Supplement" (archived). I've put red rectangles around the Unicode equivalents for the codepoints where Windows's almost-Unicode and Unicode have different characters, i.e. where the symbol referred to by the Unicode equivalent codepoint is different from the symbol referred to by the Windows-1252 codepoint.
(archived version of this image)
It's those two [self control; don't type in a cuss word] rows ... calming myself down ... It's those two rows that cause what you variously referred to as things "getting garbled" and "always breaking". I'll refer to those to rows as the ones that make my life a pain when I'm dealing with large datasets of text files coming from different sources. Note that the more complete answer you your question, "Are there more?", is, "Yes, any of the characters/codeponts/glyphs with its Unicode equivalent having a red box around them are additional, pesky characters."
Other than those two rows, the symbols and codepoints match for Unicode and Windows's almost-Unicode. As far as I know, these mappings in the two rows are the only differences, though I'd be happy for someone to correct me if I'm wrong.
Let's see the difference of just those two rows. Those cryptic uppercase characters with dashed boxes around them are what we call the C1 control codes (archived). Control codes (archived) can be useful, like a \t
tab character, or they can be things that sent instructions to teletypes (archived) instructing them to do things like shooting the paper out of your teletype printer (archived) or ringing a bell (archived) to get the teletype operator's attention.
(archived version of this image)
(Those images were cut and pasted from screenshots of the Wikipedia articles on Windows-1252 and Unicode.)
The different characters, meaning the ones that Windows-1252 has, include the quotation marks you (the OP) have mentioned as well as the em dash (archived) (you were right) and en dash (archived) that Microsoft Office software automatically converts your hyphen into when two words or numbers surround the hyphen.
I've put together a table showing the differences for some of the codepoints. I hope it shows enough for one to figure out what's happening. I think it covers the symbols that cause the most problems with this encoding mess.
Codepoint |Win-1252| Unicode |Unicode name for
in Win-1252|symbol | symbol |Win-1252 symbol
-----------+--------+---------+----------------
0080 | € |<control>| EURO SIGN
(0081) |<unused>|<control>| n/a
0082 | ‚ |<control>| SINGLE LOW-9 QUOTATION MARK
0083 | ƒ |<control>| LATIN SMALL LETTER F WITH HOOK
0084 | „ |<control>| DOUBLE LOW-9 QUOTATION MARK
0085 | … |<control>| HORIZONTAL ELLIPSIS
0086 | † |<control>| DAGGER
0087 | ‡ |<control>| DOUBLE DAGGER
...
0091 | ‘ |<control>| LEFT SINGLE QUOTATION MARK - see note (2)
0092 | ’ |<control>| RIGHT SINGLE QUOTATION MARK - see note (2)
0093 | “ |<control>| LEFT DOUBLE QUOTATION MARK - see note (2)
0094 | ” |<control>| RIGHT DOUBLE QUOTATION MARK - see note (2)
...
0096 | – |<control>| EN DASH
0097 | — |<control>| EM DASH
...
009F | Ÿ |<control>| LATIN CAPITAL LETTER Y WITH DIAERESIS
I hope this is useful to you and to others who face the same issue. I need to go outside and chop some wood before I get more angry about this issue ; )
By the way, the 'Unicode, Dammit!' (UnicodeDammit
) (archived), part of beautifulsoup4
(archived), has a detwingle
method (archived) that can perform decoding when, as you (the OP) said,
[You] experience copying some text from Word into a website form or something,
and things break. "Windows breaking things".
Notes:
U+0041
, but let's not complicate things too much. We probably also shouldn't complicate things too much by saying that the CP-1252 (i.e. Windows-1252) codepoints should correctly only be written as, e.g. 41
and not 0041
.See also:
The TTY Demysified (archived) for info on the teletype / tty and on control characters.
A few links for control characters more specifically:
1 (1archived) || 2 (2archived)
|| 3 (3archived)
And finally, for a lot of detail,
last (archived)
Wikipedia articles for em (archived) and en (archived), the typographical units that are the concept on which the em- and en- dashes are based.
Upvotes: 1
Reputation: 13942
There are at least 1,114,111 valid Unicode code points. My US-standard keyboard makes those that fall between 1 and 127 (base 10) reasonably easy to access.
When you venture beyond that range you start getting into either old style locales, or more modern UTF8 (or other Unicode) code points. Many of these code points are easily accessible from a keyboard somewhere in the world. But from the comfort of your own home or office, you'll find a fairly small subset of those 1.1 million to be easily accessible from your keyboard.
There is a Unicode property called QMark
(the short name), or Quotation_Mark
(the long name), that includes 29 quotation style code points (in UTF8, hex): 0x0022, 0x0027, 0x00ab, 0x00bb, 0x2018, 0x2019, 0x201a, 0x201b, 0x201c, 0x201d, 0x201e, 0x201f, 0x2039, 0x203a, 0x300c, 0x300d, 0x300e, 0x300f, 0x301d, 0x301e, 0x301f, 0xfe41, 0xfe42, 0xfe43, 0xfe44, 0xff02, 0xff07, 0xff62, and 0xff63.
Here's how they look (assuming your fonts support them all):
"'«»‘’‚‛“”„‟‹›「」『』〝〞〟﹁﹂﹃﹄"'「」
There happens to be a Unicode property ASCII
, which not surprisingly contains 128 code points between 0 and 127.
I can't seem to find a Unicode property that specifies "Everything that is not ASCII", but you will know it by virtue of the fact that it falls outside of the 0 .. 127 range.
There is also a Hyphen
Unicode property that contains eleven code points: 0x002d, 0x00ad, 0x058a, 0x1806, 0x2010, 0x2011, 0x2e17, 0x30fb, 0xfe63, 0xff0d, and 0xff65. I'm reluctant to paste them all here, as at least two of them don't render in my terminal. But here goes:
-֊᠆‐‑⸗・﹣-・
As you can see, some are indistinguishable from others. When I use the Hyphen
property in Perl 5.16 I get a warning that the particular Unicode property is deprecated. I don't know if that's just for Perl, or if it's for Unicode in general.
There is also a Dash
property containing 27 code points. I think you get the idea, so I won't enumerate them here. ...and another named Dash_Punctuation
with 23 code points. Note that many code points can be categorized by more than one Unicode property, so it's possible that there is overlap between Hyphen
and Dash
, and probably even more overlap between Dash
and Dash_Punctuation
-- I don't know and haven't checked.
I know this isn't a Perl-centric question by any means, but I've found that Perl has pretty good documentation of the Unicode properties here: perldoc perluniprops.
So I guess the short answer to the question, "Are there more?" is yes, there are about 1.1 million more.
Update: Regarding what these pesky characters are called.... You sort of have to differentiate between code points and glyphs. A code point is the unambiguous representation of a Unicode entity, whereas the glyph is what it looks like. Different fonts may implement a given glyph differently from each other. So what looks the same in one font may look a little different in another. Start thinking of Unicode code points, and their associated full names as having semantic meaning, whereas glyphs are simple graphical (unreliable) representations.
Update 2: In some programming languages (specifically Perl, but possibly others) you may create custom character classes using set logic. In Perl, these are referred to as Extended Bracketed Character Classes, and are discussed in perldoc perlrecharclass
. If you wanted to match all quotes that are not within the ASCII range, you could use this subexpression:
(?[\p{QMark}-\p{ASCII}])
The subexpression above creates a character class that matches all quote-like marks excluding those that come from the ASCII range. This is a feature that was introduced to Perl in Perl version 5.18. Given that this "Update 2" was added in 2019, and Perl 5.18 was released in 2013, the feature has been available for roughly four years. Unfortunately I find no indication that it has found its way into the PCRE libraries outside of Perl.
Though it has been around for four years already, this feature (as of Perl 5.28) is still marked 'experimental'. Therefore, to use it you should add the following pragma in the scope where it is used:
no warnings qw(experimental::regex_sets);
That will squelch the experimental warning. I would not be surprised to see that warning lifted in a near-future release of Perl.
Upvotes: 6