Reputation: 3249
Since I am working with many different fonts and have a special treatment for each of these symbols, I would like to standardize all quote and apostrophe entries in my text fonts.
I'm looking for something similar to this entry for skip lines
content=re.sub(r'\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]', '\n', content)
or for hyphens
content = regex.sub(r'\p{Pd}+', '-', content)
Can you help me?
Upvotes: 2
Views: 2231
Reputation:
Note that these categories are subjective.
For example, there is no single Unicode property for Single Quote
or Double Quote that will give you the span you're looking for.
But, you could play around with subsets, for example
\p{Block=General_Punctuation}(?<=\p{Quotation_Mark})
will give the subset of these ‘’‚‛“”„‟‹›
Whereas using just \p{Quotation_Mark}
will give this subset "'«»‘’‚‛“”„‟‹›⹂「」『』〝〞〟﹁﹂﹃﹄"'「」
where some might be questionable quotation marks.
Here's another one \p{Line_Break=Quotation}
which gives these "'«»‘’‛“”‟‹›❛❜❝❞❟❠⸀⸁⸂⸃⸄⸅⸆⸇⸈⸉⸊⸋⸌⸍⸜⸝⸠⸡🙶🙷🙸
So, be warned, there is no definitive SET according to Unicode
specs.
Probably for the hyphen \p{Pd}
, the equivalent regex would be
find (?:[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD)
replace -
And for the single quote
find: [\u0060\u00B4\u2018\u2019]
replace '
And for double quote
find [\u201C\u201D]
replace "
Also note that every character has many Unicode properties
that will match it, so traversing a sample string you can see
overlapping property relationship, like here ->
Upvotes: 1
Reputation: 626896
If you use the Uniview tool you could search for all Unicode symbols that contain reference to "single quotation mark", "double quotation mark", "apostrophe",e.g.
Here are somewhat pruned outputs:
Single quotation marks, [\u02BB\u02BC\u066C\u2018-\u201A\u275B\u275C]
(see demo):
ʻ
- 02BB MODIFIER LETTER TURNED COMMAʼ
- 02BC MODIFIER LETTER APOSTROPHE٬
- 066C ARABIC THOUSANDS SEPARATOR‘
- 2018 LEFT SINGLE QUOTATION MARK’
- 2019 RIGHT SINGLE QUOTATION MARK‚
- 201A SINGLE LOW-9 QUOTATION MARK❛
- 275B HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT❜
- 275C HEAVY SINGLE COMMA QUOTATION MARK ORNAMENTDouble quotation marks, [\u201C-\u201E\u2033\u275D\u275E\u301D\u301E]
(see demo):
“
- 201C LEFT DOUBLE QUOTATION MARK”
- 201D RIGHT DOUBLE QUOTATION MARK„
- 201E DOUBLE LOW-9 QUOTATION MARK″
- 2033 DOUBLE PRIME❝
- 275D HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT❞
-275E HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT〝
-301D REVERSED DOUBLE PRIME QUOTATION MARK〞
- 301E DOUBLE PRIME QUOTATION MARKApostrophes, [\u0027\u02B9\u02BB\u02BC\u02BE\u02C8\u02EE\u0301\u0313\u0315\u055A\u05F3\u07F4\u07F5\u1FBF\u2018\u2019\u2032\uA78C\uFF07]
(see demo):
'
- 0027 APOSTROPHEʹ
- 02B9 MODIFIER LETTER PRIMEʻ
- 02BB MODIFIER LETTER TURNED COMMAʼ
- 02BC MODIFIER LETTER APOSTROPHEʾ
- 02BE MODIFIER LETTER RIGHT HALF RINGˈ
- 02C8 MODIFIER LETTER VERTICAL LINEˮ
- 02EE MODIFIER LETTER DOUBLE APOSTROPHÉ
- 0301 COMBINING ACUTE ACCENT̓
- 0313 COMBINING COMMA ABOVE̕
- 0315 COMBINING COMMA ABOVE RIGHT՚
- 055A ARMENIAN APOSTROPHE׳
- 05F3 HEBREW PUNCTUATION GERESHߴ
- 07F4 NKO HIGH TONE APOSTROPHEߵ
- 07F5 NKO LOW TONE APOSTROPHE᾿
- 1FBF GREEK PSILI‘
- 2018 LEFT SINGLE QUOTATION MARK’
- 2019 RIGHT SINGLE QUOTATION MARK′
- 2032 PRIMEꞌ
- A78C LATIN SMALL LETTER SALTILLO'
- FF07 FULLWIDTH APOSTROPHEUpvotes: 6