DanielTheRocketMan
DanielTheRocketMan

Reputation: 3249

Standarzing double quotes, single quotes and apostrophes in python

Since I am working with many different fonts and have a special treatment for each of these symbols, I would like to standardize all quote and apostrophe entries in my text fonts.

I'm looking for something similar to this entry for skip lines

content=re.sub(r'\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]', '\n', content)

or for hyphens

content = regex.sub(r'\p{Pd}+', '-', content)

Can you help me?

Upvotes: 2

Views: 2231

Answers (2)

user13843220
user13843220

Reputation:

Note that these categories are subjective.
For example, there is no single Unicode property for Single Quote
or Double Quote that will give you the span you're looking for.
But, you could play around with subsets, for example

\p{Block=General_Punctuation}(?<=\p{Quotation_Mark}) will give the subset of these ‘’‚‛“”„‟‹›

Whereas using just \p{Quotation_Mark}
will give this subset "'«»‘’‚‛“”„‟‹›⹂「」『』〝〞〟﹁﹂﹃﹄"'「」
where some might be questionable quotation marks.

Here's another one \p{Line_Break=Quotation}
which gives these "'«»‘’‛“”‟‹›❛❜❝❞❟❠⸀⸁⸂⸃⸄⸅⸆⸇⸈⸉⸊⸋⸌⸍⸜⸝⸠⸡🙶🙷🙸

So, be warned, there is no definitive SET according to Unicode
specs.


Probably for the hyphen \p{Pd} , the equivalent regex would be

find    (?:[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD)
replace -  

And for the single quote

find:   [\u0060\u00B4\u2018\u2019]
replace '

And for double quote

find    [\u201C\u201D]
replace "

Also note that every character has many Unicode properties
that will match it, so traversing a sample string you can see
overlapping property relationship, like here ->

enter image description here

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626896

If you use the Uniview tool you could search for all Unicode symbols that contain reference to "single quotation mark", "double quotation mark", "apostrophe",e.g.

enter image description here

Here are somewhat pruned outputs:

Single quotation marks, [\u02BB\u02BC\u066C\u2018-\u201A\u275B\u275C] (see demo):

  • ʻ - ‎02BB MODIFIER LETTER TURNED COMMA
  • ʼ - ‎02BC MODIFIER LETTER APOSTROPHE
  • ٬ - ‎066C ARABIC THOUSANDS SEPARATOR
  • - ‎2018 LEFT SINGLE QUOTATION MARK
  • - ‎2019 RIGHT SINGLE QUOTATION MARK
  • - ‎201A SINGLE LOW-9 QUOTATION MARK
  • - ‎275B HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT
  • - ‎275C HEAVY SINGLE COMMA QUOTATION MARK ORNAMENT

Double quotation marks, [\u201C-\u201E\u2033\u275D\u275E\u301D\u301E] (see demo):

  • - ‎201C LEFT DOUBLE QUOTATION MARK
  • - ‎201D RIGHT DOUBLE QUOTATION MARK
  • - ‎201E DOUBLE LOW-9 QUOTATION MARK
  • - ‎2033 DOUBLE PRIME
  • - ‎275D HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT
  • -‎275E HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT
  • -‎301D REVERSED DOUBLE PRIME QUOTATION MARK
  • - ‎301E DOUBLE PRIME QUOTATION MARK

Apostrophes, [\u0027\u02B9\u02BB\u02BC\u02BE\u02C8\u02EE\u0301\u0313\u0315\u055A\u05F3\u07F4\u07F5\u1FBF\u2018\u2019\u2032\uA78C\uFF07] (see demo):

  • ' - ‎0027 APOSTROPHE
  • ʹ ‎- 02B9 MODIFIER LETTER PRIME
  • ʻ ‎- 02BB MODIFIER LETTER TURNED COMMA
  • ʼ ‎- 02BC MODIFIER LETTER APOSTROPHE
  • ʾ - ‎02BE MODIFIER LETTER RIGHT HALF RING
  • ˈ - ‎02C8 MODIFIER LETTER VERTICAL LINE
  • ˮ - ‎02EE MODIFIER LETTER DOUBLE APOSTROPHE
  • ́ ‎ - 0301 COMBINING ACUTE ACCENT
  • ̓ - ‎0313 COMBINING COMMA ABOVE
  • ̕ - ‎0315 COMBINING COMMA ABOVE RIGHT
  • ՚ - ‎055A ARMENIAN APOSTROPHE
  • ׳ - ‎05F3 HEBREW PUNCTUATION GERESH
  • ߴ - ‎07F4 NKO HIGH TONE APOSTROPHE
  • ߵ - ‎07F5 NKO LOW TONE APOSTROPHE
  • ᾿ - ‎1FBF GREEK PSILI
  • - ‎2018 LEFT SINGLE QUOTATION MARK
  • - ‎2019 RIGHT SINGLE QUOTATION MARK
  • - ‎2032 PRIME
  • - ‎A78C LATIN SMALL LETTER SALTILLO
  • - ‎FF07 FULLWIDTH APOSTROPHE

Upvotes: 6

Related Questions